











mmmmmmmmmmmm, 



mmmmm 



wiKiiimmm 



■\ 



OOCUNBNT BIISUNII 

EO 032 214 SC 007 464 

Final Report to the National Science Foundation on Contract NSF *C414 Task 1. 

American Chemical Societyi Columbusi Ohio. Chemical Abstracts Scrvict. 

Spons Agency 'National Science Foundationi Washington! OjC* 

Pub Date Mar 69 
Note*l94p. 

EORS Price MF -S0.75 HC -S9.80 

Cescriptors'*Chemistryi Information Dissemination. ^Information Processing. ^Information Science. S^tems 
Development 

Identifiers'Chemical Abstracts Service. Chemical Registry System 

Chemical Abstracts Service developed and expanded a computer-based Chemical 
Registry System and operated it on a large-scale pilot basis. Some 1.744.319 
registry transactions were made, resulting in the addition of 988.806 unique 
substances to the Registry Files. Continual effort has been made to improve computer 
capabilities and procedures, add new types of compounds, and improve user services 
and programs. In addition to registration, this included (1) the testing of alternative 
input methods for chemical information, including the development of special 
keyboards and keyboarding conventions: (2) the installation of several operational 
adjustments in the System: and (3) the development of computer support systems. 
Tabular data are included to show various operations and characteristics of the 
System. (RR) 








ED0322U 












FINAL REPORT 



to the 



NATIONAL SCIENCE FOUNDATION 



on 



CONTRACT NSF-C414, TASK I 



CHEMICAL ABSTRACTS SERVICE 



\N 




AMERICAN CHEMICAL SOCIETY 

U.S. DEPARTMENT OF HEALTH. EDUCATION & WELFARE 
OFFICE OF EDUCATION 

THIS DOCUMENT HAS BEEN REPRODUCED EXACTLY AS RECEIVED FROM THE 
PERSON OR ORGANIZATION ORIGINATING IT. POINTS OF VIEW OR OPINIONS 
STATED DO NOT NECESSARILY REPRESENT OFFICIAL OFFICE OF EDUCATION 
POSITION OR POLICY. 



Columbus, Ohio 



March 1969 
























! 

I 

I 

I 

I 

i 

i 




I 

i 



I 



i 



I 

I 




I 

* 

ft 



FINAL REPORT 
to the 

NATIONAL SCIENCE FOUNDATION 
on 

CONTRACT NSF-C4I4, TASK I 



CHEMICAL ABSTRACTS SERVICE 
AMERICAN CHEMICAL SOCIETY 



"PERMISSION TO REPRODUCE THIS 
COPYRIGHTED MATERIAL HAS BEEN GRANTED 



BY. 



Fred A. Tate 



Chemical Abstracts 

TO ERIC AND ORGANIZATIONS OPERATING 
UNDER AGREEMENTS WITH THE U.S. OFFICE OF 
EDUCATION. FURTHER REPRODUCTION OUTSIDE 
THE ERIC SYSTEM REOUIRES PERMISSION OF 
THE COPYRIGHT OWNER." 






Columbus^ Ohio 



March 1969 















1 



CONTENTS ] 

I 



ABSTRACT 

I. INTRODUCTION 

II. OUTLINE OP REGISTRY OPERATIONS 

III. NUMBERS AND SOURCES OF REGISTERED SUBSTANCES 

IV. REGISTRY INPUT METHODS 

A. Input of Structural Information 

B. Input of Nonstructural Information 

V. IMPROVE^3ENTS AND EXTENSIONS IN THE REGISTRY 

A. Extensions of Registry Algorithm 

B. Computer-Checked Temporary Identification Numbers. . . 

C. "Dot-Disconnected" Convention 

D. Extension of the Registry to New Classes of Compounds, 

E. Improved Text Descriptor Processing 

VI. DESKTOP ANALYSIS TOOLS 

A. Italicization 

B. Capitalization 

C. Checking of Punctuation Consistency 

D. Elimination of Invalid Characters 

E. Printing of Diagnostic Comments 

F. Nomenclature Sort Key Program 

VII. ' REDESIGN OF THE REGISTRY COMPUTER SYSTEI^IS AND 

REPROGRAMMING FOR THE IBM 360 COMPUTERS . . 



1 

, 2 

. 5 

. 10 

. l6 

. 16 
. 22 

. 27 

. 27 

. 27 

. 28 
, 28 
. 33 

. 36 

. 37 

. 38 

. 39 

. 39 

. 39 

. 1*1 



ke 



A. Reprogramming the Structure Registry j*7 

B. Reprogramming of Nonstructural Systems ^8 



i 



VIII. GLOSSARY 



51 









.lUUJWyMItU'UiliM 




APPENDIXES 



APPENDIX A 

An Overview of the CAS 
Chemical Registry System 

APPENDIX B 

The Chemical Compoxind Registry 
APPENDIX C 

A Description of the 

CAS Chemical Registry System 

APPENDIX D 

The Generation of a Unique Machine Description 
for Chemical Structures — A Technique Developed 
at Chemical Abstracts Service 

APPENDIX E 

The Computer-Based Subject Index Support 
System at Chemical Abstracts Service 

APPENDIX F 

Improvements in Structure Registry Effected 
with the Redesign and Reprogramming for 
IBM 360 Computers 

APPENDIX G 

Systems for Registering and Naming Polymers at 
Chemical Abstracts Service 






11111111111111111111111111111^^ 


















.iu.uuyuwiy]iii^^ 



1 



LIST OF FIGURES 



Pa^e 



REGISTRY SYSTEM CHART A 7 

REGISTRY SYSTEM CHART B 8 

REGISTRY SYSTEM CHART C 9 

FIGURE 1: CAS Text-Typing Keyboard l8 

FIGURE 2: CAS Structure-Typing Keyboard 21 

FIGURE 3: Example of a Computer- Produced Data Sheet 2h 

FIGURE k; Examples of Computer Produced Diagnostic Comments Uo 

FIGURE 5: Illustration of Ordering on Compound Parent 43 

FIGURE 6: Illustration of Ordering Ignoring Prefixes 44 

FIGURE 7: Illustration of Ordering by Numeric 

Value ^'fhen Alphabetics Are Identical 4^ 



FIGURE 8: T^ed Structure For the B Chain of Bovine Insulin 49 













LIST OF TABLES 

Page 



TABLE I - Registry Status Summary Report. 12 

i 

TABLE 2 - Total Number of Registrations Performed 

through l6 March 19^9 13 

TABLE 3 - Summary of Registration l4 

TABLE 4 - First Source of Names on Nomenclature File 13 









MUWUUWMWWUWIIWWMWiMMi^ 



ABSTRACT 

Under Contract NSF~CUlU (l June 1965 - l6 March 1969)* Chemical 
Abstracts Service has developed and expanded a computer-based Chemical 
Registry System and operated it on a large-scale pilot basis. As a re- 

s, 

suit ^ Task I, some 1,7^^ *319 registry transactions were made, resulting 
in the addition of 988*806 tinique substances to the Registry Files. To- 
gether with registrations from other sources, this brought the total 
Registry File to 1,079*^^! unique substances. 

Following the initial development of the Registry System, continual 
effort has been made to improve computer capabilities and procedures, add 
new types of compounds, and improve user services and programs. In addi- 
tion to registration, this includes (l) the testing of alternative input 
methods for chemiccd information, including the development of special 
keyboards and keyboarding conventions that reduce the effort required to 
generate data in machine language for input to the computer; (2) the in- 
stallation of several operational adjustments in the System that increase 
overall efficiency and broaden the range of compounds that are machine 
registered; and (3) the development of computer support systems that in- 
crease the productive efficiency of the chemical and clerical staff of 
the Chemical Registry System. 

This Final Report describes the work performed under this contract 
and indicates the present status and operation of the Chemical Registry 
System. In addition, tabular data are included to show various operations 
and characteristics of the System. 
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I. INTRODUCTION 



Between 1 June 19^5 and l6 March 1969* Chemical Abstracts Service, a 
Division of the American Chemical Society, contracted with the National 
Science Foundation to undertake the experimental development and pilot 
operation of a con^uter-based Chemical Registry System. The function of 
this information-handling system is to organize and file information about 
chemical substances, identifying each substance on the basis of its con- 
ventional two-dimensional, structural diagram and resolving different versions 
of the same diagram so as to uniquely identify each substance. The Registry 
System files structural data, molecular formulas, names, and bibliographic 
citations in a set of interrelated manual and machine files, making possible 
the quick retrieval of various items of data about the substances filed. 

The development of the Chemical Registry System was undertaken as a 
key step in building an operational, integrated, man-machine system for 
manipulating information about chemical substances — a system that would 
be capable of high speed, flexibility, and depth in responding to the in- 
formation needs of those who use chemical information. The system of 
registration provides a basis for organizing substance-oriented data selected 
from the full range of the scientific and technical literature. The Registry 
System was established on a computer basis to assure maximum consistency, 
efficiency, timeliness, and responsiveness of operation. 

The value of a single, unified computer-based repository is readily 
apparent when it is noted that an estimated of all published chemical 
literature relates in some way to chemical compounds or mixtures. Already, 
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with Just over four years of operations the CAS Chemical Registry System 
includes information about more than one million substances, and eventually 
data on the estimated three million or more compounds reported in CA and 
Beilstein are expected to be incorporated into the System. Such a large- 
scale system, based on computer handling » allows correlations to be made 
between the items of data in the file which would not be practical with 
y pHnuai information resources. In addition, the data within the System can 
be retrieved in a variety of ways , and manipulated to make many types of 







subject-oriented, index-type listings available with a minimum of human 
intervention. 

The Registry System ope 3 *ates by assigning a unique number, called a 
Registry Number, to each compound when it is first entered into the file. 
Whenever a compound which is already on file appears in a new reference, 
the previously assigned number is automatically recovered for use in filing 
the new reference. The System has three principal files of data: the 

atom-bond connection records of the chemical structure, the various forms of 
nomenclature associated with each substance, and bibliographic identifica- 
tion of the sources from which each compound was registered. A description 
of the Registry System* s scope and potential use is provided in Appendixes 
A and B, 

The objective of Task I of Contract NSF-CUlU has been (l) the building 
up of the Registry Files by registering the compounds from a variety of 
sources including CAS internal reference files and the current literature 
as reported in (2) accumulating performance data under large-scale 
pilot-plant conditions; (3) testing various methods of input to the system; 
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(U) broadening the range of compounds that can be handled in the machine 
system; and (5) developing and producing analysis tools and computer sup- 
port systems with which the chemical and clerical staff of the Chemical 
Registry System can operate more efficiently. 
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II. OUTLINE OF REGISTRY OPERATIONS 



Registry Charts A, B, and C (pages 7, 8, and 9)9 are included here 
to summarize the flow of data in the CAS Chemical Registry System for com- 
pounds registered as a result of processing information for Chemical 
Abstracts (CA) indexes. Appendix C describes Registry workflow in greater 
detail. 

Registration is an identification procedure for chemical substances. 

By definition, registration is the process of determining whether or not 
a substance is in the Registry Files and of establishing a unique numerical 
label ("Registry Number") for each different registered substance. Thus, 
the Registry System depends upon the algorithmic manipulation of informa- 
tion in a conventional structural diagram to create one, and only one, 
machine record for each substance. This record must always be the same 
for each different substance — it cannot vary with the size or orienta- 
tion of the structural diagram, with arbitrary numbering conventions, or 
with varying chemical nomenclature. The algorithm used in the CAS Registry 
System is described in Appendix D, and has been mathematically proven to 
accomplish the unique identification of each substance. For each unique 
substance, a Registry Number is assigned as an identifying tag. This num- 
ber is serially assigned by a computer algorithm; the numbers have no 
established pattern relating to the structural configuration of the com- 

t 

pounds registered. Each Registry Number is a nine-digit number, the last 
digit being a machine-computed check digit for use in reducing the number 
of transcription errors. 
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The Registry Nuraher functions as a machine address within the associated 
files, acting as a tie between the information related to a given compound in 
each of the System's files. For instance, it ties nomenclature and biblio- 
graphic information to the structural representation in the connection table 
file. It can also serve this function in files of conceptual information 
such as computer files of biochemical and physical property data, which may 
be developed by users of the System. 

The Registry Number also has other uses such as providing a standardized, 
concise form of identification for organizing collections of substance- 
oriented data. Since the System provides unique identification for compounds, 
it eliminates the likelihood of filing information at two or more unrelated 
points under unrecognized synonymous names. 

For certain types of substances — a small minority — structural 
definition is incomplete or conventions have not yet been established for 
registration. Such substances can be registered manually by a chemist work- 
ing with a small set of files. The nonstructural information about manually 
registered compounds is added to the computer files and thus is available 
for use along with information on computer-registered substances. 
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REGISTRY SYSTEM CHART A 
Oita Flon in the CAS Cempound Reglitiy System 
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RUGISTRY SYSTtiM CHART C 
Floiv Sheet for Compounds Soleetcd from Synthetic Sections of Chemical Ahstracts 



Those <m* lijitc-h with ilio Uitclios Imstsl cmi iho urpnixaUon of ii-niorial InCA. In this ntfitorial nanV'.^ of iho c-om|ioumls are ofion not includod in the* 

d^;unvms ln'in;: iihIonoJ. Thusi ii is usually nx>ro off io lord u> imnvss ila* mioorial tho .siriu luri'-dmwiiu* o|HTntiim and then throu^jh iho suhjort indosiii^ 
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III. KUI-5BERS AND SOURCES OF REGISTERED SUBSTAI^CES 



One of the main objectives of Task I was the buildup of useful files 
of chemical information through the registration of substances identified 



during the indexing process for CA, substances contained in several CAS 
internal reference files, and substances identified in specified published 



reference works such as the Colour Index , the Merck Index > and United States 
Adopted Names. 



Registration of substances identified in indexing CA began with CA 
Volume 62 (January - June 1965), and continued with subsequent volumes 
through Volume 69 (July - December 1968), processing for which was in 
progress as of 16 March 1969, when contract NSF-CUlU ended. 

Initial registration of the specified CAS Reference Files and pub- 
lished sources was undertaken during the first two years of Contract NSF-CUlU. 



Most of the reference files are routinely updated as pertinent new informa- 



tion is added to them. One of the files, the CAS Silicon File, was regis- 
tered as a special project between July 1968 and January 1969* This file 
contains organic silicon compounds referenced in Indexes from Volume 1 
through the present and in Beilsteins Handbuch der Organischen Chemie . 

In addition to these Task I sources , the Registry System includes 
^0,375 compounds registered by CAS prior to the start of Contract WSF-CUlU, 
and 82,505 substances added to the files as a result of other projects. 

Tables I through IV summarize the activity and status of the Registry 
as of 16 March 1969 . Table I presents an overall summary of the size of 



* Registration will be completed under Contract WSF-C853. 
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the files. Table II gives the total number of registrations performed, 
vhile Table III shows the number of these registrations that have re- 
sulted in the addition of new compounds to the files. (This table also 
breaks down the number of compounds into machine and manual registrations . ) 
Table IV shows the first source of names on the Nomenclature File. That 
is, for each source listed, the table shows how many new names that source 



contributed to the file. 
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TABLE I 



REGISTRY STATUS SUI4T4ARY REPORT 



1 









j 







( 



i 

i 



I 



L 

f 



1. SIZE OF FILES 

a. Number of Different Compo\inds 

b. Number of Mixtures 

c. Number of Different Nemes on File 

d. Estimated Number of References 

2. IWMBER OF REGISTRATIONS (MACHINE AND 14ANUAL) 

a. Substances New to File 

b. Sn.bsto,nces Matching One on File 

TOTAL 

3. SOURCES OF REGISTRATION 

a. ^ Indexes 

b. Task I Files 

c. Task III 

d. Other 

TOTAL 



is 








1,0TU,319 

5,232 

1,1i20,235 

2,181,600 



1,079,551 

1,102,091 

2,l8l,6U2 



1,529,139 
215,180 
38,329 
398, 99U 
2,181,6U2 
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TABLE II 



Total Number of Registrations Performed 
through l6 March 19^9 



Source 



Total Registrations 



A - Current Literatiire 



B - 



CA Volume 62 


168,859 


CA Volume 63 


203,895 


CA Volume 6U 


210,563 


CA Volume 65 


218, OUl 


CA Volume 66 


20H,397 


CA Volume 67 


210,575 


CA Volume 68 


211,it21 


CA Volume 69 


101,388 


Subtotal 




CAS Reference Files 




Alkaloid File 


4,318 


Colour Index 


15,019 


Drug File 


1,622 


Fluorine File 


59,814 


CA Formula Index Cross-References 


3,260 


Lange Handbook of Chemistry 


9,722 


Merck Index 


21,266 


Pesticide Index 


932 


CAS Reference File 


439 


Ring Index (plus supplements) 


30,204 


Silicon File 


15,141 


SOCM Handbook 


6,768 


CA Specific Volume Cross-References 


10,713 


Steroid File 


1,540 


CAS Subject Index Cross-References 


31,073 


Terpene File 


1,829 


USAN (United States Adopted Names) 


1,520 



C 

D 



Subtotal 

Other Registration (Not Task I) 
Total* 



* Includes U0,375 compounds registered before 1 June 19^5 • 



1>929>139 



I 






2i$g8o 



U3T,323 

2q8l,6U2 
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TABLE III 

S»IARY OF REGISTRATION 



Source 



Unique Substances Registered 
By Machine Manually 



^ *“ Curr en t Literature 
CA VoTvme '()2 
CA Volume 63 
CA Volume 64 
CA Volume 65 
CA Volume 66 
CA Volume 67 
CA Volunie 68 
CA Volume 69 



115.177 6,318 

131.958 3 262 

120,700 3,U13 

106,553 7,634 

99,929 6,253 

99,530 i,,i62 

95,709 4,554 

54,798 234 



Subtotal 



660 q8U 



B - CAS Reference Files 



Alkaloid File 
Colour Index 
Drug File 
Fluorine File 

CA Formula Index Cross-References 

Lange Handbook of Chemistry 

Merck Index 

Pesticide Index 

CAS Reference File 

Ring Index (plus supplements) 

Silicon File 

SOCMA Handbook 

CA Specific Volume Cross-References 
Steroid File 

CAS Subject Index Cross-References 
Terpene File 

USAN (United States Adopted Names ) 
Subtotal 

C - Other Registration (Not Task I 



747 


1,403 


3,270 


4,190 


335 


2 


42,255 


3 


952 


20 


4,205 


565 


6,02 V 


2,456 


154 


16 


175 


29 


23,473 


mm 


11,843 


148 


6,480 


l40 


1,725 


136 


348 


311 


16,105 


53 


517 


372 


89 


78 


82,505 


8,240 



128,622 

90,745 



D - Total* 



1.079 >551 



* Includes all registration prior to 1 June 19^5 
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TABLE IV 

FIRST SOURCE OF NAMES ON NOMENCLATURE FILE 



Source 


Fames on File 
17 March 1969 


A - Current Literature 




CA Volume 62 
CA Volume 63 
CA Volume 6k 
CA Volume 65 
CA Volume 66 
CA Volume 67 
CA Volume 68 
CA Volume 69 


171, it 31 
179,it63 

219,623 

211,516 

184,653 

180,257 

209,454* 

110,975» 


B - CAS Reference Files 




Alkaloid File 
Colour Index 
Drug File 
Fluorine File 

CA Formula Index Cross-References 


3,161 

39,092 

2,559 

51,672 

3,409 

11,898 

29,980 

3,388 

3,293 

21,512 

15,591 

27,292 

18,220 

2,682 

62,827 

3,202 

2,518 


Lange Handbook of Chemistry 
Merck Index 


Pesticide Index 

CAS Reference File 

Ring Index (plus supplements) 

Silicon File 

SOCf«lA Handbook 

CA Specific Volume Cross-References 
Steroid File 


CAS Subject Index Cross-References 
Terpene File 

USAN (United States Adopted Names) 


Other Registration (Not Task I) 


132,631 



* Processing incomplete as of 17 March 1968 
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The Chemical Registry System requires the routine trcinslation into 



machine language of structural information, names, references, and control 
information. In this large-scale system, it is of great economic importance 



to develop efficient procedures for creating this data in machine-readable 



form while at the same time assuring the reliability of the information that 



enters the files. 



During the early stages of Contract CUlU, CAS experimented with several 



methods of recording structural and nonstructural information in machine- 



readable form. These tests involved equipment evaluation, the establishment of 



keyboarding conventions, and the continued refinement of data flow paths 
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IV. REGISTRY INPUT METHODS 



The following sections describe the various input methods that have 



been experimented with as well as some improved procedures which have been 



instituted to save time in the processing. 



A. Input of Structural Information 

1. Manual Generation of Connection Tables Followed by Keypunching. 



The first input method used for the regijcration of structures in 



the system was the manual clerical generation of connection tables followed 



by the keypunching of connection-table data into cards. In this process, 
each nonhydrogen atom in the structural diagram is numbered by a clerk, and 
then a connection table is written which lists each atom by number, indicating 















- 17 - 



] 

i 

I 



1 



i 

I 

mm 




the atom to which each is bonded arid the tyj^es of connecting bonds. 

Finally, each rank of the table for input to the computer is keypunched. 

This manual procedure was used during the first nine montlis (June 
1965 through February 1966) of the contract. 

2. Direct Keypunching of Connection Tables . 

Beginning in October 1965, CAS began testing a direct keypunching 
of connection tables. In this modification, the manual generation of con- 
nection tables is eliminated and they are keyboarded directly from the 
structure, which the clerk numbers before she keypunches the information. 

This method has an obvious advantage over the first method in the 
elimination of a processing step. However, tests were conducted to determine 
whether this advantage was outweighted by decreased output end increased 
errors in the connection tables. The experiments showed that direct key- 
boarding of connection tables created a net decrease in the cost of input. 

3* Direct Typing of Connection Table s 

In September I 965 , CAS began testing the Dura Mach 10 paper-tape- 
punching typewriter as a direct keyboarding device for connection tables. 

This typewriter produces both hard copy and a punched paper tape in which 
each typed character is coded in machine language. In using the Dura type- 
writer to generate connection tables, each line of data is typed on the 
worksheet appended to the bottom of the Registry form. To ensure the most 
efficient operation, CAS designed a special keyboard for this typewriter. 
Figure 1 illustrates this keyboard. 
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Figure 1; CAS Text-iyping Keyboard. Specicil characuers are in the uppercase positions 
as shown and in both positions on k^s 0 and ^0-if3. Where no uppercase symbol is shown, 
the conventional capital letter is available. Such capitals are used as special **flags” 
in most text-typing operations, l^percase is indicated by the double dagger (^) (k^ Uo) 
preceding the lowercase letter. 
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The use of paper- tape-punching typevriters requires the conversion of 



data from punched paper tape to magnetic computer tape. Initially, the 
only equipment available to accomiDlish this conversion required tvo con- 
version steps; first, conversion from paper to punched IBM cards using a 
Dura Converter, then the transfer of data from cards to magnetic tape using 
the computer. 



Rather than use this inefficient method, CAS rented an early pro- 



duction model of the Digi-Data paper-tape to magnetic-tape converter. 



which accomplished conversion in a single step. 

After some initial difficulaties with the hardware, CAS foxmd this 



converter to be satisfactory. Moreover*, conversion was accomplished at 
about 37^ of the cost of the previous method. Hov.’’ever, since the average 



net cost of input by this method was found to be higher than that incurred 
with the Mohawk 1101 Data Recorder (see below), its use was suspended in 
February 1966. 

h, Mohawk 1101 Data Recorder. 



Since the beginning of March 1966, CAS has been using the Mohawk 



1101 Magnetic Tape Keyed Data Recorder for input of connection tables. 



The Mohawk 1101 is a type of keypunch which records directly onto magnetic 



tape, eliminating the need for punched cards or paper tape. Although its 



operation is almost identical to the keypunch, the Mohawk is slightly faster 
for certain operations (e.g., skipping or duplicating) because the tape 



transport speed is faster than card transport speed. There is no method, 
except computer printout, for producing hard copy on the Mohawk 1101, but 



this has caused no problems in the direct generation of connection tables. 
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5. Improved Error- Correct lop Technique. 

Early in 1966, CAS instituted nn im'proved, computer-rupported 
error-correction technique for connection tables. In order tc avoid re- 
keyboarding the entire table when one vas rejected for error, CAS developed 
a pending routine that ’’saved" the rejected table and printed it for the 
clerk. Upon identifying the error, the clerk then entered only the ranks 
of the table that were in error, not the entire table. The corrected in- 
formation, merged with the "saved" table, then re-entered the input cycle. 

6. The Structure Typewriter . 

In April 1967 s a system of direct structure typing was instituted. 

A Mohawk llGl Data Recorder connected to a specially modified typewriter 
mechanising^ permits a clerk to copy hand-drawn structures, producing hard 
copy and computer-usable magnetic tape simultaneously. A companion computer 
program then converts the typed structure to a connection table for pro- 
cessing. The advantages of this method lie in the elimination of all 
clerical conversion steps required for the generation of the connection 
table, in the reduction of the average number of keystrokes per structure, 
and in the reduction of errors because the clerk is merely copying rather 
than translating it into another format. 

This method has been found to be highly economical in both cost and 
time factors. Currently approximately 90 % of all structures are input by 
this system. The remaining structures are generally complex compounds which 
are not readily adaptable to typing on the typewriter. These structures are 
registered by means of connection tables, as described in point ^4 above. 

* See Mullen, J. M. ,’”'Atom-by-Atora Typewriter Input for Computerized Storage 
and Retrieval of Chemical Structures." J. Chem. Doc . J, p. 88-93 (1967). 
Figure 2 shows the layout for this typewriter keyboard. 
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7* Tlic Use of Input Short c uts for StructuraJ. Inforraat Ion . 

The routine procer.sine of structural diacrnjiis for registration in- 
volves the repeated handling of many structure fragments that are c,omraon 
to several compounds. CAS has developed input shortcuts to reduce the 
processing time and the chcmce of error in handling such fragments by a 
system whereby a group of atoms can be handled as if it were a single atom. 
One such symbol — Ph for phenyl — was used at the start of the contract. 
Since the.Uj additional shortcuts liavc^ been developed. The computer prograjn 
recognizes each symbol and ’’expands" the record to the full set of atoms 
and bonds represented. Thus, the same computer record results whether or 
not the symbol is used. 

B. Input of Konstructural Information 

There are two basic computer-based systems for entering nonstructural 
data into the computer files — the Name Matching System and the Data 
Sheet System. 

1. Name Matching Capability . 

With the large numbers of registered compounds on file, it is 
sometimes more efficient to update the files by retrieving the Registry 
Number through a compound *s name rather than its structure. The name of 
the previously registered compound can be matched against the names al- 
ready on file, the Registry Number retrieved, and the nei 7 bibliographic 
information on the compound can be added to the file via the retrieved • 
Registry Number. 

Manual name matching was used during the first nine months of the 
contract by chemists employing the Desktop Analysis Tools. However, this 
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method became inefficient as the files grew larger. Therefore, during the 
first quarter of I 966 , CAS developed and began using a cornputer~based name- 
matching capability by which nomes of compo\mds being processed for regis- 



tration may be compared with already registered compounds. The chemist must 
still be involved in the process to recognize ambiguous names. Wien an 



exact match is achieved, the Registry Number, the molecular forraula and the 

^ index name (if available) can be retrieved from the Nomenclature and 
Bibliography Files, 



In routine operations, the computer-based Name Match System is used 
for those segments of the input which experience has shown to conta.in a 
relatively large percentage of compounds vrith common trivial names. These 
segments include the Biochemical, Industrial, and Physical Sections of CA 
(see Registry System Flow Chart B, page 8 ). 

2, The Data Sheet System. 

Beginning in late 1965» CAS began developing the Data Sheet System, 
so called from the special work sheets which are printed by the computer to 
facilitate chemists’ review of data being input to the files. (See Figure 3 .) 
The Data Sheet System is the fundamental tool whereby nonstructural informa- 
tion — primarily names and references — is developed in machine language, 
edited, corrected where necessary, manipulated by the computer, and finally 
output for updating the Registry Files and for use in ^ indexes. 

The first requirement for this system was the installation of equipment 
with which clerks could, in a single keyboarding, produce data in both hard- 
copy and machine-language forms. The standard electric typewriters that 
had been used to generate CA index cards were replaced first with 
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F ifflu^e 3« Exeiaple of a CoinT)utcr-Pl'‘oduced Data Sheet. The top line of infor- 
mation "shows the sequential number of the .Da;ba Sheet > the type of -v?orkGheet 
(’’Data Clian($ed in the example; other possible types include ”fer IJorksheet /' 
and "Entry lOlled'') , and the date the Data Sheet was produced. The second 
line of infonnation gives the CA identification., Including the vol-ume and 
section numbers and the start and end of the batch of abstracts being processed; 
the indexer’s initials; the typist’s initials, and the Julian-form date of 
original typing. The third line gives the ^ reference for the data that 
follovr. 

In the following data, the left-hand columii contains single-letter codes 
that indicate the beginning of each new field of data: the F field for the 

molecular formala (MF)j the R field for the Preferred CA Index Name (1"IN)> the 
N field for the Added Index Name (AIN), and the C field for control informa- 
tion. Each of the name fields may contain the. name heading (h); the n^ne 
modification (M); Stereochemical identification (S) for the name (not illus- 
trated.); the te^rt modification (Hi) used in the CA Subject Index; and the corner 
note (CN), a coded production aid used in index generation. The Control infor- 
mation field includes the CA identification (ID), and the TID number or Registry 
Number (T/R) of the compound. 
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pai^er-tape-xjunching elect.>:ic typevrriterB , then vrith Kohav]^, Data Reciordors. 
This change eliminated a separate keypunnliing operation that ho,d pre- 
viously been used to generate names in machine language., and it allowed the 
CA index names to be captured on tape a ^3 a by-product of the index transcrip- 



tion process. 



Beginning in January 19^6 > CAS began, large-scale operation of the Data 
Sheet System by installing new computer programs and changing the data flow- 
in the Compound Registry System to provide corrputer support for the review 
of nonstructural infonnatiori by chemists. (See Appendix E.) With this ex- 
tension 9 the De.ta Sheet System ensures that no information enters the master 
Registry J'ile until it has been reviewed by an appropriate staff member and 
corrected vrhere necessary. At the same time, the system prevents redundant 
editing of names and references entering the com^mter file. 

The role of the Data Sheet System in Registry ojjerations is illiistrated 
in Registry System Flow Charts A, B, and C. In outline, the Data Sheet 
System in its present form operates as follows : Information, is t 5 >-ped on 

tape-generating typewriters, thus sii.iultaneously creating a hard copy for 
review and a magnetic tape for input to the computer. The information 
is processed by the computer into a work-pending file, and on call from a 
CAS chemist, the data are p»rinted on computer-generated data sheets. These 
are proofread against the original hard copy as a check on reliability, and 
the data are then reviewed for chemical accuracy. the necessary cor- 

rections have been made, keyboarded, and entered into the work-pending file, 
the chemist approves the transfer of the reviewed data to the master Registry 
Files. At the time of transfer, the computer "flags” the data as having 
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I 



been edited; this fla*^ serven the purpoae of prevontinE umiecoesary re-review 



of the data at sowe later t?‘,rae. 

The Data Sheet System also makes it possible to eliminate much repet- 
itive keyboarding because the system provides ulie typist with sj^Gcial codes 
that signal the comi)Uter to repeat Bpecifically identified infoxniation 
which need not be retyped. Like the input shortcuts for structural infor- 
mation, these short codes are expanded by the compixter into their full 
representation. Thus, failure to use the code will not affect the appear- 
ance of data in the computer file. A brief example of such "dittoing" is 
given below for the Id code, which reduces keyboarding for lists of CJi 
inverted systematic names. The Id code signals the repetition of that por- 
tion of a CA index name which precedes a comma, followed by a space, e.g., 
the comma of inversion. This dittoing feature reduces keystrokes in the 

Name Typed as 



Galactopyranose, 2-aceta:nido-2-deoxy- Galac topy ratios e , 2-acetainido~2-deoxy 

Galactopyranose , 3-amino-l , 6-anhydro-3-deoxy IdS-amino-l , 6-anhy dro-B-^deoxy 
Galactopyranose, 1,6-anhydro- Idl,6-a.nhydro- 



typing of alphabetised lists of names, where one parent may have several 



substituents grouped at one point on the list. Other dittoing codes enable 



I 



the typist to signal the repetition of data from one sub-field in a data 



sheet to another sub— field in the sarae sheet or from one sub-field in a 



data sheet to the corresponding sub-field in the following sheet(s). 
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■V. IMPROWIJ-ffii'ITS AND EXTENSIONS IN THE REGISTRY 



Since the basic 



registration 



capability \ras established in 19^5 » CAS 



has continually worked to extend the 
This section describes many of these 



scope and efficiency of the system, 
improvements made after initial opera- 



tions of the Registi^^ had begun in I965. The extensions of the system 
undert alien dui-ing the redesign and reprograrcmiing of the System for the IBM 
360 computers ax*e described in Section VII. of this report. 



A. Extension of Registry Algorithrg 

Very early in the Contract work, the Registry Algorithm was modified 
to detect patterns of symTaetry in a structure and to utilize such symmetry 
to prevent the unnecessary generation of equivalent connection tables as 
"candidates" for the unique connection table. (Such candidate tables are 
generated and coBipared with each other to permit the hierarchical sorting 
and selection of the one \mique table). As a result of this modification^ 
the average time required for the computer generation of unique connection 
tables was reduced about 20^, 

B. Computer-Checlied Temporary Identif i cation Nuinbers 

Temporary Identification (TID) numbers are used in the Registry System 
to identify each structure at input until a Registry Number is assigned to 
the substance or retrieved for it. With the large-scale input required 
under Task I, CAS adopted computer-checkable TID numbers in June 1965* 

These TID numbers which are preprinted on Registry Sheets, consist of six 
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digits and a letter suffixo The digital portion is evenly divisible by 
seven, thus providing Kiachine checkability » At the time this TID number 

format was adopted, computer routines to check the TID numbers were also 
instituted* 



C. ”Dot~Di r. c on n e c t e d’ ' C ojive n 1 1 on 



The "dot-disconnected" convention for structures and molecular formulas 



of salts, clathrates, addition compounds, etc. \ 7 as adopted during the first 



six months of Task I. V7ith this convention, a compound is repre.sented as 
the structure of the "parent" portion, followed by a dot and the structure 
of the salt forming ) • Each moiety following the first (there may 

be several moieties, each separated by a dot) carries a nume''ical prefix 
to indicate its proportion to the first (pt^xent) moiety. Tliis coefficient 
may be an "x" if the rextio is imlrnovm. The molecular formulas of "dot 
disconnected" compounds are treated analagously to the structural diagrams. 



• This treatment saves storage space in the registiy, allows the group- 
ing and automatic cross referencing of related compounds (for example, salts 
with the same parent) and permits registrextion of those partiallj^-determined 
structures in which the ratio between two fully defined moieties is unl^nown. 



D« Extension of the Registry t o New Classes of Compoimds 

At the start of Task I, CAS registered only organic (i.e., carbon- 
containing) compounds for which fully defined structural diagrams could be 
drawn (excluding coordination compounds). Since then structuring conven- 
tions and computer programming have been accomplished to extend registra- 
tion to several additional types or classes of compounds. In general, the 
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CAS policy has been to accupiuD&te nonreginterable compoimcls for study, and 
when enoufih extuiiples of any one type are available, to establish structuring 
conventlcns and do tlie nccessaiy conputer progrojiming to per/ait registration 
of these additional classes. Under Contrfict the following types of 

compounds wore added to those that could be iriachine registered: 

1; Machine Registration of Free Radlco.Xs 

.Ii < ii n ■ 1 f I i m « ^ I -IT - T -T 1 1_ ■ -n i'« r > ■ 



In October 1965* CAS decided to register free radicals (that is, 
compounds in which, one or more atoms bear an unpaired electron) by specif^'-- 
ing in the connection table the abnormal valence of the atom(s) bearing the 
unpaired electron(s). Specif^''ing the valence I'esults in an override of the 
computer editing technique for the valence of atoms, and permits the com- 



pound to be registered with the specified abnormal valence. 
2. Compounds Labeled Uns p e ej f ico.lly wi th Xsotonoo 



In several aroo,s of chemistiy , it is coirmion to "label" compounds 
vrith isotopes in order to permit tracing the compound in an experiment. 

The Registry System from the beginning has been able to handle isotopical3.y 
labeled compounds when the structural positions of the labeling isotopes 
are known. In September 19^5 » an extension vms made in the system to permit 
registration of labeled compounds vdien the structural positions of the 



isotopes are unknown or when the number of atoms of the labeling isotope 
is unspecified. 




This extension involves the use of textual descriptors for unspecifically 
labeled compounds; each descriptor consists of the elemental symbol and mass 
number of the isotope, together with an arabic numeral or an "X" indicating 
the number of atoms of the isotope. For example, a compound labeled with 
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tvro atoms of carbon l^-i would be rt'glstei'ed vrith a textual descriptor’ 2C~l}|. 
If the numb(<’r of laboliuf^ atoms wore unknown, the descriptor would be XC-1**4, 

3» 01 ijt.^omers 

Duriiif; the second quarter of 1966, CAS developed the structuring 
conventions to permit the registration of oligomers. These compounds are 
polymers in whicli a fully defined unit is repeated a luiown number of times, 
but for which the interconnection between units is usually unknown. An 
example is the propylene pentamor Oligomers are registered on the 

basis of their structural units, with a text descriptor (dimer, trimer, etc.) 
indicating the munber of repetitions of the unit. 

Inorganic Co mpo unds 

Inox’ganic compounds are defined as those that do not contain carbon. 

As a result of study of this d ass of compound during the early registry 
operations, CAS sto.ff determined that inorganic compounds could best be 
registered by adopting the bonding conventions used for organic compounds, 
thus retaining full structural specificity, fi’he accimiuloted file of in- 
organic compoimds (i.e., the inorganic compounds from CA Volume 62 on and 
from the CAS reference files) were registered between j\me 196? and February’’ 
1968 . Since then, inorganic compounds have been routinely registered. 

9 . Coordination Compounds 

For the Registry, coordination compounds are defined as compounds 
in which the number of attachments to a central atom (usually a metal atom, 
but not necessarily) exceeds the generally accepted valence of the central 
atom. In general, structures are represented by direct connection (bonds) 
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between the centra.l aton and the ottaebed {'.roups ( 15 {.lituids ) . T)ie o>ddaticn 



state of the central ntoin :i.s represented hy a cJiurgo which ir.ay be poSD.tlvCj 



negative, or siero. A ligand ii’oy also bo neutral^ , ox- negative; 



its charge is indicated Oj.i the atom directly attached to the central atom. 



The coordination number, repre::ented by the number of attachments to the 



central atom, and its oxidation state are signlf5.eant data for coordination 



compoimds of that element, and time must be included in the recorded data. 



In preparation for the F.ightli Collective Indexing Period (3.9oT-197l) 
CAS staff defined now structuring and ncmdng conventions fo.r coordino,tl on 



compounds, permitting their cons is tent tj’eatiaent in indexing and registration. 



Coordination cojiipouuds having six or farcr at bacljmenty to the central atom 
were inx>ut to the ,Hegistx''y System as it operated for the 7010 coiuxmterG . 
The repx'ogJ'£itmnin(:; of the Registry Structure System for the IBM 36o allowed 
input of structui'ca hovixjg as many as 19 o.ttachments to any one atom. In 



addition, the number of charges and the mr-ber of attachments to a given 
central atom are automat icall^'* edited in tlie 3^0 system. 

6, Chemical ineme.nts 



The chemical elements themselves were not initially processed be- 
cause the Registry System x/as programmed not to accept any "graph” that in- 



volved only a single nonbydrogen atom. This provision, designed as an input 
editing check, proved in practice to be of little value. Therefox^e, a 



prograjii adjustment has been made to perrait the registration of the elements, 
their isotopes, and allotropes. The elements and their isotopes and allotx-opes 
from CA Volumes 62-66 were registered betvreen June 1967 and February 1968; the 
less common isotopes and allotropes continue to be processed for current 



volumes of CA. 
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7« Boranes and Carbox^anes of Known Structure 

Boranes and Carboranes, vhich contain rooleculsr skeletons of 
interlinked boron atoms or interlinked boron and carbon atoms, respectively, 
exhibit unusual aspects of structural representation. Based on n study of 
such compounds, and aided by nomenclature conventions defined by the ACS 
Council Committee on Nomenclature, CAS staff developed appropriate struc- 
turing conventions and input procedures to permit these compounds to be 
routinely registered. These procedures were available for use in February 

1968. 

8, Partially Determined Structures 

Partially determined structures are substances about which some 
structural information is known, but for which full structural definition 
is not available. Two types of partially determined structures — "dot 
disconnected" fragments with unknown ratios, and oligomers — are now 
registered, as described above. In addition, CAS has defined the struc- 
turing conventions needed to register several additional classes. These 
include compounds in which two or more fully defined fragments are con- 
nected in an unspecified way. Some limitation may be given about the 
connection — for instance a substituent may be known to be attached to a 
ring in a ring-containing compound but not to any of the chain portions of 
the compound. Another type of partially determined structure includes one 
fully defined structural portion and one or more attached fragments repre- 
resented only as summation molecular formulas. 

The procedures developed for handling these structures allow the 
computer record to reflect as much specificity as is present in the original 
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document* Implementation of these procedures awaits completion of com- 
puter programming. 

9 « Polymers 

Polymers present problems in registration because these substances 
are made up of recurring structural units, each of which may be regarded 
as derived from a specific compound, or monomer. The number of units is 
usually large and variable, a given polymer sample being characteristically 
a mixture of structures with different molecular weights. Since the be- 
ginning of Contract C4l4, CAS has been studying the ways polymers are 
reported in the literature in order to define efficient and meaningful 
registration methods* These studies shox/ (see Appendix g) that polymers 
can be registered: 

a. On the basis of their structural repeating units (SRU*s) with 
monomer information if available; 

b. In terms of their monomers, if no SRU information is given; 

By nonstructural data (name, application(s), generic types), 
if no structural information is given. 

Procedures for registering polymers have been developed; implementation of 
these procedures awaits the completion of computer programming. 

E* Improved Text Descriptor Processing 

Those aspects of a chemical structure not incorporated into the two- 

dimensional structural diagram — for example stereochemistry — are 
handled in the Registry System by means of text descriptors . Each such 
descriptor is a string of letters, numbers, and/or punctuation that is 
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an integral part of the structural record. Two compounds that possess the 
same two-diinensioneil stinicture but differ in text descriptor will be 
assigned different Registry Numbers. To assure accuracy and consistency in 
this process, all such sets of compounds with like two-dimensional struc- 
tures but different descriptors are reviewed in context by a chemist before 
final acceptance by the computer. 

1. Standardized Descriptors for Alkaloids, Carbohydrates, Steroids, 

and Terpenes . 

As the Registry System has matured, CAS has developed standardized 
and somewhat codified stereochemical descriptor conventions for four groups 
of compounds; alkaloids, carboiiydrates , steroids, and terpenes. In general 
such descriptors ore made up of abbreviations of the CA Preferred Names for 
the parent compounds, plus alphanumeric prefixes indicating stereochemistry 
of specified nodes of the structure. Each name implies the stereochemistry 
at certain nodes in the structure; thus, the prefixes need specify stereo- 
chemistry only for the nodes not implicit in the name. As an example, 
consider the alltaloid compound with the CA index name Crinan-3a-ol,18,23- 
epoxy-T-methoxy. "Crinan" is the parent compound and nodes 1, 2, and 3 
have stereochemistry that must be specified. The descriptor for this com- 
pound is 1B,2B,3A-CRINAN. 

2. Automatic Editing of Text Descriptors 

In order to reduce the amount of professional attention required 
for the resolution of text descriptors, a table of acceptable descriptors 
has been established for computer checking (see Appendix F). Descriptors 
that exactly match one on the list will be accepted by the computer without 
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review by a chemist. Other descriptors can be entered in the file only 
when a chemist "flags" the descriptor to indicate that it is acceptable. 
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VI. DESKTOP ANALYSIS TOOLS 



Contract calls for the development of Desktop Analysis Tools 

(DAT*s) ^7hich, by functioning as specialized indexes to material in 

the computer files of the Chemical Registry System, eliminate some 
of the costly reprocessing of data that would be required if structures 
were completely reprocessed and re-registered each time they were encoun- 
tered. Under Task I, CAS has developed and continually improved Desktop 
Analysis Tools that are computer-produced compilations of chemical names 
that have been entered into the system. These tools are used by clerical 
and chemical staffs to link the name of a compound with its Registry Number, 
and are thus useful in manual name-matching. 

The first Desktop Analysis Tool produced was a single volume contain- 
ing names ordered in strict alphanumerical sequence beginning with the 
first character in a name. While this ordering was straightforward and 
consistent, it resulted in the separation of some names that chemists were 
accustomed to seeing together. For exan^le, trans- stilbene was alphabetized 
with the t*s, whereas cis- stilbene was alphabetized with the c*s. More- 
over, as the Registry Files grew larger, the DAT*s did also, making them 
more cumbersone to use and requiring some method for updating. Tlierefore, 
during the first year of the contract, CAS developed methods for publishing 
the DAT*s in different volumes based on the type of name, for creating periodic 
supplements to the DAT's (each supplement containing only material added to 
the files since the last full DAT was issued) , and for arranging names more 
nearly in the order expected by chemists. 
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The requirements of Task III of Contract C4lh, together with general work 
on improved input and correction procedures for names, resulted in a 
greatly improved capability for producing DAT's. The Task III DAT's are 
described in the "Final Report to the National Science Foundation on 
Contract NSF-C4l4, Task III," and in the special CAS report to the NSF: 
"Policies and Procedures Governing the Compiling of the Desktop Analysis 
Tools for the Common Data Base." 

Some of the generalized procedures for the improved handling of 
chemical nomenclature, which affect not only the DAT's but also the entire 
file of nomenclature, are described below. These computer-based checking 
procedures support the keyboarding and input procedures for chemical names , 
helping to assure consistency in the file and reducing the need for clerical 
and chemical review. 

A. Itali citation 

The editing program developed at CAS automatically italicizes any 
single Roman letter surrounded by punctuation characters. It also itali- 
cizes a single alphabetic character that begins or ends a name, excluding 
small capital letters. The lower-case letters "d" and "t," used to cite 
the hydrogen isotopes deuterium and tritium, are also recognized and 
italicized, as is the capital letter "H" for hydrogen under certain condi- 
tions. Also italicized are the alphabetic characters used within ring 
system names when they occur within the ring fusion brackets, unless they 
are immediately preceded by numerals. 
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In addition to the above, there is a list of 5^ character strings that 
are automatically italicised vrhen surrounded by punctuation. This list 
includes such stereochemical descriptors as "cis-," ’’trans-,” and "erythro-." 
The s€une list is used to designate terms which are not to be capitalized 
when they occur at the beginning of a compound name and are followed by 
punctuation. This list is open-ended, and additional descriptors can be 
added as they are incorporated in chemical nomenclature. 



B. Capitalization 

Analyses of computer nomenclature files indicate that there are an 
average of 1.6 capital letters per name. The majority of these are the 
first letter of the basic name. The progi-ara automatically supplies this 
capital if it occ\irs in a sequence of two or more Roman letters. It will 
not capitalize the letter if it is within any of the character strings to 
be italicized. 

Other capitalization is also automatically supplied by the program. 
The single alphabetic character which occurs most frequently as a capital 
within a name is the letter *'H, *' usually to indicate hydrogen atoms in 
ring systems. The computer program identifies this letter in context as a 
capital "H" follovred by any punctuation character and preceded by (l) the 
punctuation character, "prime,'' (2) by a numeral, or (3) by one of the 
letters a, b, c, or d, that, in turn, is preceded by a prime or a numeric 
character. The Roman numerals 1 through 10 are also identified and capi- 
talized as are the stereochemical descriptors "R" and "S" cited within 
parentheses . 
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^V Checking of Punctuation Con sistenry 

To prevent clerical errors from being recorded in the files, the com- 
puter is programmed to check for and correct (l) punctuation such as 
periods and commas incorrectly typed at the end of nsmes , (2) for two 
blank spaces where only one is required, (3) for hyphens omitted between 
parentheses and brackets^ and locants which immediately precede or follow 
such punctuation marks. 



D. Elimination of Invalid Characters 



Certain symbols such as the equality sign and quotation marks are used 
in Registry System operations as keyboarding conventions and do not imply 
their normal meanings. In the past, they have caused some confusion at 

printout. The computer program now checks for these characters and re- 
moves them when necessary. 



E. Printing of Diagnostic Comments 

Errors such as spelling, letter transposition, and significant punctua- 
tion cannot yet be automatically corrected. However, to reduce the number 
of names that must be manually reviewed, programs are in effect that detect 
the potential problem and display the questionable name along with a diagnostic 
comment (see Figure h) . This technique directs the professional’s attention 
to the potential problem immediately, thereby reducing the time required to 
make editorial decisions. In addition, the computer files can also be up- 



dated without rekeyboarding the compound names 
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BIBLIOGRAPHY DIAGNOSTICS 



RCCn.TRY 

NCi. 



message: 



2,635^134 MULTIPLE PREFERRED CA INDEX NAMES 

Benzene y l-bromo-2, 3-( methyl enedloxy)- 
Benzeney 4-bromo-l *2-( methylenedioxy)- 
1 «3~Benzod ioxoley 4-broiro'*- 



5,169,788 TWO IDENTICAL NAMES EXCEPT FOR PUNCTUATION 

CisH.rNSg 

3-(Di~2-thIcnylmethylene )-l-!ncthyl piperidine 
027, 

3-(0 ithien-2-y Imethylcne )-l-methyl pi peri dine 
029, 



5,173,659 TWO IDENTICAL NAMES EXCEPT FOR DIFFERENT LOCANTS 

Cl zHgo 

Naphthalene, 2 ,3,4,4a,5,5,7 ,8— octahydro-l,4a-dime^ 
027, 

Na phtha I cne , 1,2, 3, 4 ,4 a ,5,6 ,7— octahydro— 4 a, 8—dime^ 
027, 



5,498,704 TWO IDENTICAL NAMES EXCEPT FOR DIFFERENT STEREO 

CisHzsNO 

Azaeyclotr Idecan-2-one, 1-mcthyl- 

027, 

Azacyclotridecan-2-one, 1— methyl-, trana — 

028, 



Fi^vire Ut Examples of Computer Produced 
Diagnostic Comments 
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F» Nomenclature Sort Key Program 

In order to provide a listing of names ordered the \ray chemists expect 
to see them, CAS has developed a nomenclature sort key technique by which 
to arrange names. 

‘ The method by which the Nomenclature Sort Key Program operates is as 
follows: Each name is divided (by internal computer routine) into four 

major portions: the parent, the substituents, the modifications, and the 
stereochemistry. Not every name will have all four portions, but $ except 

for laboratory names, all will have at least an alphabetic parent. They 

\ 

I 

may or may not possess the other portions described belo\f. In a chemical 
name containing all possible portions, each field will contain an alpha- 
betic string of characters and locants comprised of minerals or letters or 
both. The computer program is ^/ritten such that each major field is in- 
vestigated individually and in the order cited above. The Sort Key is gen- 
erated so that, within each major field, the characters are rearranged with 
the letters first, followed by the locant. V/hen the rearrangement is com- 
plete, the parts of the name are ordered in the following manner: 

1. Alphabetic portion of parent 

£. Locant portion of parent 

3.. Alphabetic portion of substituents 
Locant portion of substituents 

5.* Alphabetic portion of modifications 

• 'A 

6 » Locant portion of modifications 

7.. Alphabetic portion stereo 
Locant portion of stereo • 
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As an example of this rearrangement, the compound 1,2-Cyclohexanedi- 
carboxylic acid, 3-hydro:or-, 1-ethj'l ester, (+), would be rearranged as: 
cyclohexanecarboxylicacid 1 2 hydroxy 3 ethylester 1 (+). This rearranged 
name is the Sort Key. When a printout is called for, a program automatically 
alphabetizes the names by ordering on the Sort Keys. However, the printout 
contains not the Sort Key, but the properly edited name. 

The above program accommodates virtually all types of chemical nomen- 



clature, with one major exclusion — laboratory names, which are usually 



j conqposed solely of numerics and alphabetics. A second program is designed 

to sort this type of name. For example, the laboratory name B9963DEX would 
be sorted first by its numeric portion, the 9963, then on its alphabetics. 
Printed, the laboratory names precede the names ordered on alphabetics. 

Examples of typical nomenclature index ordering compiled and printed 
by computer using the Nomenclature Sort Key Program are shown in Figures 
5 through 7. 

! 
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RCCISTRY 



JiyiUlLB- 



ITEM 

miSB 



NAME 

II£C 



NAME. MOtECUtAR rORMUl.A._AMI) S QUPjgM 



O 

ERIC 



1374 ;i 6 
13;!7733 
29 »lAr ,4 
3240708 
6981 8 

i 91 



>894144 



6411467 

86726 



53862 S 
60106 
08891 
524801 
3240208 



36 

447 

8 F 

18 

33 



3 

3 

3 

! 

3 



CA 



86'/48 


136 


3 


if^bazoTc 

Carbazole 


6033881 


11 


1 


Carbazol e. 


6033870 


18 


1 


Carbazole, 




3A 


3 


Corbazole, 


6377124 


37 


1 


Carbazole, 


132321 


69 


1 


Carbazole, 


6402137 


11 


1 


Carbazole, 


3244540 


13 


1 


Carbazole, 


66202 


19 


1 


Carbazole, 


66204 


IS 


1 


Carbazole, 


1484124 


19 


1 


Carbazole, 


6033905 


12 


1 


Carbazole, 


6202159 


IS 


1 


Carbazole, 


1404135 


12 


1 


Carbazole, 




205 


3 


Carbazole, 


524001 


12 


1 


Carbazole- 


6209230 


13 


1 


Carbazolc- 


524001 


45 


3 


9-Carbazoli 


6209230 


35 


3 


9-Carbazol < 


6407636 


44 


1 


Corbazole-i 


132616 


16 


1 


N«0pCl3< 

Carbazole-: 


6407847 


47 


1 


Carbazole-: 



CarbatMone NS«CtHs*Na 
Cerbazcne Olue D Cl 
Carbazcpine N}OC|sM|c CCAC 

Carbezic aeidt 3**( o-aethylbanzyl )-t «thyl «8t«P NtOtC||H|« 
Carbazochrome NcOfCioHit CAS 
Carbfl zochr pne «o6(mw oulfanate N«0«S6««illtp*No 

aat<|j(}TVwnT»oifttimai»^^ iiwww<nvin>.[miHWiti f«)iaBWia>aM 

NC|'^*• CAS, CA,R1,CBAC, MERCK 



INN.CEAC 



pierete 
potaaai wm 



NC,«H«*iN, 0 ,C«Ha 
deriv NC|«H«*K 



rtCRCR 



MERCK 



Cl 



N-potaaaium aalt NCijH^^K 
S^atnlno” MzC|#M|q CA,C1 

3-aml no-9-ethyl- NtC|«H|« 

2 , 7**d i a ml no~ MjCitHii Cl 

3,6”dlnltpO“ M)O^C|/H, CA,CZ 

9-cthyl- NC|«H,, CA,CI 

9-c thy 1-3-nl t po- NtOjCi^Hic Cl 

9-mcthyl- NC,,H,, CA 

9-( phcnoxyaeetyl >- MOjCtoHig 

1 , 2 , 3 , 4 ** t a t p an 1 1 po^ 



9-vinyl- NC|«H|, CA 

9-vinyl- NC|«H,| CA 

Heeetle acid M0;Ct«Hn 
Haeetic acid, ethyl eater NO^Ci^iHia 
iacctic acid N0«C|«H,, MERCK 
iocctie acid, ethyl eater NOtC|eH|s MERCK 

R anilide, 4*-ehloro-l-{(2, 4-dl6hloroph«ny 1 )azo}-2-hydroxy- 



■»aM| a 



nllide, 4*-chloro-2-hydroxy* MtOeClC,«H,s 
ni 1 Ide, 4*-ehloro-2-hydroRy*l"(C 2 “M«tho«y- 6 - 

( phenylearbamoy 1 )pheny 1 ]azo )- N«0«ClC«]|Hc« 



CA 



;a 

TT 

39 

80 

9E 

48 

3C 

129 

34 

3A 



d 

3 

4 
3 
3 
3 
3 
3 
3 



Carbazole-9-propionl tri le, 3-amino- NgCieH. p CA 
Carbazol Yellow NsOpCspH, r *2Na Cl 



p-(Carbazol-3-yl amino )phenol N^OCitHi « 
p-( 3-Carbazolylaml no )phenol NgOCisHi « 

Carbazonc, diphenyl- N«0C|pHie CAS 
Carbazone, diphenyl thio- N^SCisHk CAS 
Carbazotic acid N,0,C«Hp MERCK, OPIH 

Carbazyl-N-aeetlc acid NOeC|«Hn MERCK 
Carbenzlde MeOzC||H|p INN 



Figure $: Illustration of Ordering on Compound Parent 
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RECir.TRY 


ITEM 


NAME 


R_ 


NUHBI.R 


1I££ 


7f.»‘.49 


Sb 


3 


3037727 


17 


1 


4427763 


bit 


3 


1 10609 


lb 


1 


107AB7 


38 


3 


I C 9/39 


4C 


3 


613496 


13 


1 


7564.9 


22 


1 




77 


3 



tiA M£i -UQl ECPI.M.. ronMUU^_AND_^UPC E§ 



^uwrsr 

370&21S 

942S7 



6707260 

677400 

94246 

1126709 

121006 



Ru t y 1 A*n I ne t 
Rut y I ami nc t 
Rutylamlnet 
But y 1 ami nc t 
Xl/JLUfAiJCAl. 



tortlary NC*Hi| HERCK 
4«( dl t honym?) t hy I s I I y 1 )- 
4 1 4 * •• I w I nob I 9" 

N**nietbyl~ ^C^H| ^ CA 
3-mcth^l- 



NOf S iC%Hf I 
CAtCAS 



CA 



3A 

6C 

70 

37 

ftC 

OA 

48 

34 

34 

45 

25A 

47 

36 

4A 



n**ilu t y I ami ne NC«H|| 
9P«-3 u t y 1 ami ne RC«H|| 
tcrt-Dul ylamlne NC«K|| 



rii. «M| « 



n-Rut ylamlne 



HtHCKTDPTH 
CASiCA 
CA 

hy<irochlorlde NC«M| | *c 1 li 



rmr 



3 

3 

3 

3 

3 

3 

3 

3 

3 

4 

3 

3 

3 



|04bia 


bC 


4 




4B 


3 


135900 


35 


3 


6707 29 1 


33 


3 


6707200 


3A . 


3 




6D 


3 


6093034 


. 46 


3 


136607 


33 


3 


120503 


59 


3 


571902 


37 


3 


129746 


BH 


3 


02961 


106. 


9 



w~n-Du t y I ami noace t i c acid 2-mc thy 1— 6-eh I oroani 1 1 do 
l-( Butyl am Inoa cetyl am I no )-2-ch loro-6-met hy I bcnzc’no 
Butyl amino benzoate R0zC||H|9 IP* MERCK »Nr»ADI 



N^0CtC|iH|« MERCK 
Nf0ClC||H|» HERCK 



Butyl p-ami nobenzoatc N 0 }C||H |9 

Butyl 4-aml nobenzoatc M0,C||H|9 

n**Butyl p— ami nobenzoa te N0|C||H|g 

Butyl am I nobenzoat e hydrochloride 
Butyl ami nobenzoate plerate N0«C| 
n-Butyl p-amI nobenzoatc plerate N0*C||H|8* •/aNsOyCftHa 
p-( Butylamlno)benzole acld« 2-(dlmethylemlno)cthyl eater 



CAiDCC* lECMTN 
IP 

Nr«ADI*MCRCK 
N02C||H,g*ClH 



|M|8**/zN30 



HERCK 

HERCK 

HERCK 

NzOfCigHj* 



But y 1 ani 1 1 ite NC|oH|g 
N-Dut y I ani 1 1 no HCioHk 

Butyl atedhydrozyanl sole 
B u t via ted hydrpKV toluen e 
" B^l yTBcnzcne7*”**C|[n'i«"* 
n~But y 1 benzene C|(tH|« 
acc-But y I benzene C|qH|« 

( — )— aec**Du I y 1 benzene 
, (♦) aec-Butyl benzene 



CAS 

0*C| |H| ft 

0C,,H9. 



CDAC 

VBBtCAtNERCKtCAStCBACfCFR 



CBACiHERCK 
HERCKtCAS 
C,oH|* HERCK 
C,oH*4 MERCK 



t e r t—B ut yl benzene CioH|a HERCK » CAS 



2-( p-( s-tert-Butyl-2-benzlmldozoly 1 thlo)phcnyl thlo)*7-«othylbeocothla«ola 

NzSiCjgH,* CA 

Butyl benzoate 0fC||H|« CAS»PCS 
lao-Uutyl benzoate 0tC||H|« CAS 
Butyl o-benzoy Ibcnzoate OjCiftHift IECMTM»CA5 

l-(p-tort-Botylbansyl )-4-( p-ehlorodipheny Uethyl )plperaslne dlhydrochlorlde 

l-(p-tir{-Bytylbin«B* >-6-(p-chloro-«-ph#«ylb#«*yl )ptpur««tna NpCICeoH»i 

INN 



Figure 6; Illustration of Ordering Ignoring Prefixes 













- l»5 - 



BfCISTRY 

jiytirEB- 

2 A ;’.9701 

27ii9075 

677ri?e7 

62^27003 

6470311 

5915582 

6690706 

6470231 

5930041 

5073290 

5915620 

6417203 

6420399 

6905226 



6979317 
6902256 
64 0636fi 
6905317 

2006609 

6507031 

6227016 

6227107 

6417261 



6227141 

6227196 

2429756 

6227072 

6470457 

6227110 

6420060 

6227130 

6420940 

6227209 

6426773 

6059343 

6227174 

6369204 



r/EM 



NAME 

U££ 






17 


1 


C* I • 


Dt 


12 


1 


C.I. 


Dl 


39 


1 


C* I • 


Dt 


34 


1 


C.I* 


Dl 


14 


1 


C.I. 


Dl 


10 


1 


C.I. 


Dl 


11 


1 


C. I . 


Dl 


9A 


3 


C. I . 


Dl 


14A 


1 


C.I. 


01 


12 D 


1 


C. I . 


Dl 


6 A 


1 


C.I. 


Dl 


70 


3 


C. I . 


Dl 


40 


.1 


C.I. 


Dl 


26 


1 


C.I. 


Dl 


182 


1 


C.I. 


Dl 


170 


3 


C.I. 


Dl 


9E 


1 


C.I. 


01 


107 


3 


C. I . 


Dl 


21D 


1 


C.I. 


Dl 


4A 


1 


C.I. 


Dl 


3A 


1 


C.I* 


Dl 


3A 


1 


C.I. 


Dl 


16 


1 


C.I. 


Dl 


469 


3 


C. I. 


Dl 


12 


1 


C.I. 


Dl 


136 


3 


C. I . 


Dl 


11 


1 


C.I. 


Dl 


13 


1 


C.I. 


Dl 


6 D* 


1 


C.I. 


01 


6 E 


3 


C.I. 


Dl 


IS 


1 


C.I. 


Dl 


lA 


1 


C.I* 


Dl 


•12 


1 


r' C . I . 


Dl 


329 


3 


C.I. 


Dl 


19 


1 


C.I. 


Dl 


11 


1 


C.I. 


Dl 


99 


3 


C.I* 


Dl 


16 


1 


C.I. 


Dl 


19 


1 


C. I . 


01 


12 


1 


C.I. 


01 


69 


1 


C.I. 


Dl 


50 


3 


C.I. 


Dl 


40 


1 


C.I. 


Dl 


30 


1 


C.I. 


Dl 


14 


1 


C.I. 


Dl 


69 


3 


C.I. 


Dl 


14 


1 


C.I. 


Dl 


46 


1 


C.I. 


Dl 


67 


9 


C.I. 


Dl 



cet Red 
ect Red 
ect Red 
ect Red 
ect Red 



ec t 
ect 
ect 
ect 
ect 
ect 
cct 
cc t 
ect 
ect 
cc t 
ect 

cet 

cct 

cct 

cct 

cct 



Red 

Rod 

Red 

Red 

Red 

Red 

Red 

Red 

Red 

Red 

Red 

Red 



10 

17 

49 

55 

61 

69 

03 

121 

123 

127 

127A 

127A 

130 

147 

149 

149 

152 



Red 152 

Red 153 
Red ISO 

Red 166 
Red 109 



N90TSfC3fHj3*2Na 

N 90 tS,Cs,H 33 * 2 No 

N»0| aS*C39H3o*4Na 

N909S,CI,C,aH,e«2M« 

M* 0 tS 4 C 3 iH,o' 2 No CI 

N* 0 ,tS*C 33 H**' 2 Cu« 4 M« 

H»0«S,C3oH»**2Na CI 
N«0« SC(«Hao*Mo 

H909S3C3tHae*2Na 
N*0eS8C3|Haft«2Na 
N60BS,C3iH,6«2Ne 
N 9 O 1 |S3C39H3#*3Na 

N*09S,C3oHe*'2N» 

Nr 0 iSaC 3 fHa 9 * 2 Na 
Nr0eSaC3aH,9*2Na 
NT09SaC36HjT*2Na 

NT0*S2C3eH8T*2Na CI 

N^OaSCg^Hia'SNa 

N60|«S3C39Ka6’3N0 



CI 



CI 



ect 


Violet 


1 


N60ftSaCjaH,.«2No 




cct 


Violet 


1 


N90aSaCs2Ht««2Na 


CAtCAS 


ect 


Violet 


3 


N.O, ,SjC,aH2a«3No 




ect 


Violet 


3 


N.O, iSjCjzMsj'ONa 


CI 


ect 


VI ol et 


5 


N907S2C2THai»2Na 


CI 


ect Violet 


7 


N90,S2Cs2H39*2Na 


CI 


ect 


Violet 


0 


N«OoSCs2H|9*2Na 




ec t 


Violet 


6 


N«OoSC92H26*2N& < 


CI 


ect 


Violet 


9 


N90nS2CjoH29*2Na 


CI 


ect 


Violet 


11 


N90*S2Cs.H2T'2Na 


CI 


ect 


Violet 


12 


N60bS2C32H2.«2NO 




ect 


Violet 


12 


N*0bS 2C9 2H2.* 2Na 


CI f CAS 


ect 


Violet 


16 


N90,oS9C2.H2i«3Na 


CI 


ect 


Violet 


21 


N^OyS jC 




ect 


Violet 


21 


N 902 S 2 Cj.H 2 T* 2 Na 


CI 


ect 


Violet 


26 


N90|oSjCj.H2T*3Na 


C 1 


ect 


Violet 


20 


?U0aS,C3.Ha.«2Na 


CASfCl 


ect 


Violet 


31 


N90«S2C,oH29*2Na 


Cl 


ect 


Violet 


32 


N90«S2C3«l<2T*2Na 




ce t 


Violet 


32 


N90aS2C3.H23*2Na 


Cl 


ect 


Violet 


35 


N90||33C3.H2T'3Na 




cct 


Violet 


30 


N*0»2S2C36H8e'4Ne 




ect 


Violet 


39 


N.OaSaCj.Hje'ZNa 




ect 


Violet 


39 


N.0BS2C3.H2*'2Na 


CAt.Cl 


ect 


Violet 


40 


N90«S,C3|HsT*2Na 


CI 


ect 


Violet 


41 


N90tS*C3|H,t'2No 




act 


Violet 


41 


NeOTSfCjiHaT'ZMa 


Cl 



Figure 7: Illustration of Ordering by Numeric 

Value When Alphabetics Are Identical 
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VII. REDESIGN OF THE REGISTRY COMPUTER SYSTEMS AND 
REPROGRAMMING FOR THE IBM 360 COI'TDTERS 



One of the most fundamental projects in Task I was started at the 
beginning of I 967 in order to completely redesign the Registry System for 
more efficient operation and to reprogram it for the IBM 360 computers. At 
the time, the Registry was operating on the IBM 7010 computers, and the 
basic programs had become inefficient due to the high level of modifications 
added to them as a result of initial operations. The 360 computer equipment 
represented a "third generation" of hardware capabilities and offered signi- 
ficantly improved processing times, processing capabilities, and capacity 
for future growth without major reprogramming. 

As a result of the conversion to 360 equipment, CAS was able to de- 
velop and Implement inqproved installation-wide standards for computer pro- 
grams, to incorporate in the redesigned programs the results of experience 
with the first large-scale Registry operations, and to significantly improve 
the capacity for interface with other information processors and with future 
users of the System. 

Among the overall Improvements made in all Registry processing as a re- 
sult of the conversion to 36 O hardware are standard file formats, modular 
programming, and improved hardware capabilities. These are briefly sxim- 
marized below, and are more fully described in Appendix F. 

Standard file format refers to a CAS installation-wide standard for 
controlling the definition and use of each element of data used in any 

























system. This format simplifies programming, assures consistency, and makes 
input, output, and search of the files easier. 

Modular programming refers to the design of a computer system as a 
series of "pieces" (modules), each designed for a specific function. Modu- 
lar programming increases the flexibility of a system by permitting modules 
to be changed individually without affecting other system operations. 

Hardware improvements offered by the 360 include faster core access 
times, the ability to process data one-half bite at a time (offering a 
chance for compaction of the file), and the future potential for multi- 
programming, direct access, and other advanced processing techniques. 

A. Reprograiiiming the Structure Registix 

The 360 Registry'' System was installed in April 1968 and has operated 
since that time. The technical improvements made in structure processing 
in this new System are described in Appendix F and are briefly indicated 
below. 

1. Improved Handling of Tautomers 

The new System automatically recognizes the equivalency of those 
unique compounds (called tautomers) for which two or more different but 
equally valid structural diagrams can be drawn. These diagrams do not 
represent isolable chemical species, and thus a single Registry Number is 
assigned to the tautomer no matter which alternative structure is entered. 
At the time the Registry Stricture File was converted from 7010 .format to 
360 format, any tautomers on file were changed to the new representation. 
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2 , Improved Handling of Rinp^s 

The redesigned registry programs incorporate new procedures for 
tracing paths in ring- containing compounds. The programs identify pairs of 
"ring closure" atoms as the starting point for the tracing of rings. This 
process is more efficient than the previous system of tracing rings from 
an arbitrarily selected atom. 

3. Additional Editing Features 

The new programs provide automatic editing and verification for 
certain established text descriptors, for Stock numbers (representing the 
oxidation states for selected multivalent metals ) , for "abnormal" atomic 
mass citations, and for the coordination numbers of coordination compounds. 

U. Registration of Large Molecules 

As a result of increased capacity in the 360 computers, certain 
arbitrary restrictions on the size and complexity of registered compounds 
were removed. The new system raises from 150 to 253 the number of non- 
hydrogen atoms accepted, and from six to 15 the number of nonhydrogen attach- 
ments allowed for any one atom. (See Figure 8.) 

5. Structure Match without Registration 




The new system permits a structure to be searched for in the 
Registry Files without adding it to the files if it is new. This feature 
will increase the file’s usefulness to users with confidential or proprietary 
compounds who merely want to determine whether the compounds have yet been 
reported in the literature. 

B. Reprogramming of Nonstructural Systems 

The Registry's Bibliography and Nomenclature Systems were, in their 
7010 versions, merely file maintenance systems with minimum output capability 
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Figure 8: Typed structure 5,nput for the B chain of Bovine Insulin (Cj_^yH222Nl;0^1*7^l4^ > 

a structure containing 2^*8 nonhydrogen ations. This example illustrates 
a large structure for vhich machine registration is pcssiole in the 360 system, 
but vas impossible in the 7010 system. 
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and with little capability for input editing and/or verification of names 
and references. In redesigning these systems for the IBM 360 computers, 
therefore, CAS had the objectives of greatly improved input editing capa- 
bility and improved ability for flexible retrieval and file updating. 

The redesigned systems included computer checking routines to support 
the keyboarding of names. These routines provide capitalization and 
italicization automatically, check, and in some cases correct, punctuation 




errors, and intercompare names to detect "problem** situations such as two 
names for a con^jound that differ only in locants, or two different CA 
Preferred Index Names for a compound. These improvements have been described 
in Section VI above. 

The redesigned systems also incorporated output functions to maintain 
subfiles of the Registry System and create tape and printed listings of in- 
formation according to various selection criteria imposed at output time. 

Due to the complexity of the redesigned systems, they were first 
tested on a large subfile of the Registry System, the Common Data Base of 
Task III of Contract C^lU.** This large-scale test identified several prob- 
lems, primarily in those aspects of the system that dealt with the retrieval 
and output functions. Therefore, CAS undertook to redesign these so-called 
''interface*' functions of the Registry before making the new features generally 
available. The CAS Interface System is scheduled for implementation in the 
Autumn of 1969* 

* See Final Report to the NSF on Contract NSF-CUl4, Task III, March I969. 
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VIII. GLOSSARY 



ACS 



American Chemical Society 
ALGORITHM 

A precise step-by-step procedure which, when exactly followed, vrill 
result in a successful conclusion. No intellectual judgment is required 
in executing an algoritlim. 

ALTERNATING BONDS 

A succession of alternating single and double bonds within a closed 
(cyclic) system. 

BEILSTEIN 

The term commonly used by chemists to refer to Beilsteins Handbuch 
der Organischen Chemie . 

CA 



Chemical Abstracts , the main publication of the Chemical Abstracts 
Service. 

CAS 



The Chemical Abstracts Service, a division of the American Chemical 
Society. 

CAS CHEMICAL REGISTRY SYSTEM 

An interrelated set of professional, clerical and computer-based pro- 
cesses that accomplish the registration of chemical compounds and the main- 
tenance of information files resulting from the registration process. These 
files include compound records, molecular formulas, nomenclature, and biblio- 
graphic data. Registration of a compound involves the determination of the 
existence or nonexistence of a structural graph representation in the Registry 
Structure Files. The process includes the assignment of a unique number 
(Registry Number) to each substance that is new to the files. 
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CHARACTER SET 

A specified selection of symbols (characters). The conventional com- 
puter printer character set consists of U8 upper case Latin letters, 10 
Arabic numbers, and 12 symbols. CAS has a modified computer print chain 
which allows the use of a 120 character set. Each CA publication has spe- 
cific character set requirements that are subsets of the CAS ’’universal 
character set” allowing for the upper and lower case Latin and Roman alpha- 
bet, Arabic number, superior and inferior characters, italic, lightface, 
and boldface characters, and a wide range of other symbols. 

CHECK CHARACTER (check letter, check digits) 

A technique used to verify the accuracy of recorded data. The check 
character is computed from the data content via an algorithm and is attached 
(usually suffixed) to the original data. Checking is accomplished by re- 
computing the check character and comparing it to the recorded character. 

CHEMICAL COMPOUND 

See COMPOUND. 

CHEMICAL REGISTRY 

See CAS CHEMICAL REGISTRY SYSTEM, 

COLLECTIVE INDEX 

One of a series of indexes, which, since the 5th Collective (19^7 - 
1956 ) , have covered five years of Previous Collectives covered 10 

years. Each Collective Index combines the contents of the corresponding 
volume indexes. 

COLLECTIVE PERIOD 

The period during which material is accumulated for the next collective 
index. In 19^7, CAS began the 8th Collective period, during which material 
will accumulate for the Eighth Collective Index ( 1967-71 )• 



COMMON DATA BASE 

This is made up of the published data associated with the substances 
specified in Task III for processing under this contract. The data elements 
in the Common Data Base are: (l) whatever structural description of the 

substance is provided; (2) Inverted Molecular Formula, when available; (3) 
Nomenclat\ire; (U) Source Codes; (5) ^ references for those synonyms taken 
from the Registry files. 
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COMPOUND 

A single substance made up of identical molecular species. 
COMPUTER-BASED 

A computer-based process can be wholly computer executed, a computer 
assisted manual operation, or a process in which the results of a manual 
effort are subsequently recorded for computer processing. 

CONNECTION TABLE 

• 

An atora-by-atom inventory of a molecule which shows each atom, the 
atoms connected to it, and the types of linkages (bonds). Mass number, 
coordination number, valences, and charges are shown wherever they are 
required for exact identification. Stereochemical data are included. 

DATA AND INFORMATION 

These terms are used interchangeably. 

DEFINITIVE 0LIG0I4ER 

A compound whose structure is represented by that of a single unit of 
knovm structure repeated a known number of times. The structure of the 
definitive oligomer is known only to the extent that it is a multiple of 
the single unit. 

DESKTOP ANALYSIS TOOLS 

Printed lists of specially selected substances for use in the analysis 
of input information to help identify already registered substances. These 
tools include listings of names , laboratory numbers , and acronyms with the 
associated Registry Numbers, molecular formulas and source codes. 

FLAGGING 

Flagging is a process of including control symbols in the stream of 
recorded information to indicate special conditions to the computer program. 

FORMULA INDEX 

A compilation of molecular formulas arranged in order. Specifically, 
one of the published indexes to CA . 

GRAPH (of a structure) 

The basic pattern of nonhydrogen nodes and their connecting lines in a 
structural diagram, with no designation as to the specific elements or types 
of bonds present. 



HARD COPY 



A printed copy of machine output in a readable form for human beings, 
e.g. , printed reports, listings, documents, summaries, etc. 

HARDWARE 

Physical equipment. 

HIERARCHY 

A series of objects, concepts or indexing terms divided, or classified 
in ranks or orders, as in a family tree or a bottinical classification. A 
genus-species relationship is a particular type of hierarchical relationship. 

KEYBOARD 

verb: To record data on an infonnation storage medium (e.g., papers, 

magnetic tape, punched cards, etc.) using a finger-operated set of keys. 

noun : The assemblage of finger-operated keys used on an information- 

recording machine such as a typewriter, keypunch, adding machine, etc. 

LIGAND 

The group, in coordination compounds, which either: (l) is the donor 

atom; or (2) contains the donor atom(s). 

LINE 



The site of a bond in a structural diagram. 

MACMINE-READABLE 

Refers to any representation of data in a form directly acceptable to 
a machine, specifically, to a computer. Punched cards, punched paper tape, 
and magnetic tape, for example, may all contain information in machine- 
readable form. 

MAGNETIC TAPE 

A plastic tape impregnated or coated with magnetic material on which 
information may be recorded by computer or keyboard-driven devices. Mag- 
netic tape is one of several means to store information for subsequent re- 
processing by computer. Excluded from the usage of this term in this 
document is magnetic tape used for sound or video recording. 

MANUAL 

Refers to processes handled largely by human effort. 
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MANUAL REGISTRATION 

A system of registration in which .a chemist 5 working with a small set 
of files, detei^ines the uniqueness of a substance and, on the basis of 
this determination, assigns the substance a Registry Number or retrieves the 
one previously assigned. Manual registration is used for a small minority 
of substances that do not meet the criteria for machine registration. 

MECHANICAL REGISTRATION 

A computer-based system of registration. 

MECHANIZED (mechanized) 

• \ 

For purposes of this document: synonymous with computer-based. 

MODULAR 

In computer systems design, modular systems have the following 
attributes : 

The total entity system is broken into explicitly defined sub-units 
(Modules). 

Each sub-unit is functionally independent of the other sub-units. 
Modules have standard points and m.ethods of interfacing with other 
modules . 

Modification to a module does not affect other modules if only the 
activity is changed and not the interface. 

MOLECULAR FORMULA 

A listing of each kind of element and total number of atoms of each 
kind present in a molecule. 

MOLFORM 

Molecular Formula. 

MONOMER 

The unpolymerized form of a compound that can be polymerized. 

MULTIPROGRAMMING 

\ 

A c^ic^uter processing technique that provides for the operation of 
more than one application program in a computer system at the scune time. 
Program execution alternates among the individual programs. 
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NAME MATCH 

As a registration process, a technique for determining identical 
compounds by comparing names in order to locate compounds and related 
data which might be identical. 

NODE 



A word used interchangeably with *atom* in the context of structural 
diagrams and connection tables. 

NOMENCLATURE 

All types of names used to identify chemical compounds including 
acronyms and laboratory numbers. Preferred nomenclature refers to the 
compound names preferred for use in current CA Subject and Formula Indexes. 
Systematic no menclatur e are any names that are systematically derived 
from the structure of the chemical substance. 

ORGANIC COMPOUNDS 

For the purposes of this report these are carbon-containing compounds 
exclusive of coordination complexes. 

PAPER TAPE 

A tape on which a pattern of holes or cuts is used to represent data. 
POLYMER 

A compound or mixture of compounds each of which consists essentially 
of repeating structural units. 

PROGRAM 

The complete sequence of machine instructions and routines necessary 
to solve a problem. 

PUNCHED CARD 

A card of lightweight cardboard on which information is represented 
by holes punched in specific positions, which may be processed by auto- 
matic machinery, semi- automatically, or manually. 

RECORDING 

An interrelated set of activities that cause information to be stored 
in mechanized files or on a printed page. Recording includes transcription 
or dictation, keyboeirding, and entry to the computer files. 






r!Sgp;S^3!!3S5?!P!?S!TO!!!5?S^^ 












mmrnmrnmmmmmmmm 







- 57 - 



REGISTRATION 

The process of determining the existence or nonexistence of a sub- 
stance in the Registry Files. The process includes the assignment of a 
unique number (Registry Number) to each substance that is new to the files; 
this number is used in a large., multifaceted system to associate data 
related to that substance. 

REGISTRY FORM 

A data and worksheet especially designed for Registry System use to 
contain all data needed for Registry processing and including data to be 
filed in the Registry. Included are the structure drawing, molecular formula, 
CA reference, author name for the compound, connection table, etc. In gen- 
eral, one compound will be represented by one Registry Form. 

REGISTRY NUMBER 

The unique number which is assigned to each substance when it first 
enters the Registry and which is recalled each time that substance is 
checked against the file. The Registry Number may be used to identify 
fully the substance, and in the future it can be used as the address in 
specialized subject files to identify data associated with the substance. 

A Registry Number may include alphabetic characters, and will include a 
computed check digit. 

RING 

A group of atoms and their bonds which form a closed loop. 

RING CLOSURE PAIRS 

A field of the structural record which lists pairs of atoms with a 
connecting bond which constitute the uniting of the ends of a string of 
atoms forming a ring. 

SOURCE CODE 

A set of codes attached to each name in the Registry Nomenclature File 
that identifies: 

1. Source(s) in which the name is used. These may be published 
sources, for example Journal of Biological Chemistry , Chemical 
Abstracts , and Merck Index , or private sources, for example a 
file belonging to a given organization. 

2. Organizations which have some association with the substance and/or 
its name. 
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STEREOCHEMICAL DESCRIPTOR | 

An abbreviated form of the traditional designation of stereochemistry. | 

i 

STRUCTURE 

The mode of linkage of atoms in a chemical molecule. 

STRUCTURAL DIAGRAM 

A two-dimensional graphic representation of the atoms and bonds of 
a molecule. 

SUBSTRUCTURE 

A specified set of atoms interconnected in a specified way. This 
constellation normally represents less than a complete molecule. 

SUBSTRUCTURE SEARCH 

The search through a file of representations of structures of compounds 
for a specified set of atoms connected in a specified way. The set normally 
represents less than a complete molecule. 

SUBSTRUCTURE SEARCH SYSTEM 

A computer-based structure retrieval system designed for generic searching \ 

of structural information (stored in the CAS Chemical Registry). | 

I 

TAUTOMER END POINTS | 

1 

A field of the structural record which describes the end points of a | 

tautomer string in the graph proper. It is understood that the two atoms 

in each of these tautomer end point pairs are sharing one hydrogen atom. I 

I 

TAUTOMERS 

Two or more structures which are considered valid representations of a 
given substance. 

TEXT DESCRIPTOR 

A symbol, word, or words appended to the computer structural record 
which is used to describe the stereochemistry and/or other information • 
connected with that structure which cannot be described adequately in the 
fields of the graph proper or in the other modification fields. 
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I * 3 

[ TID NUMBER j 

The ^inporary I^ntification number used to identify a compound during 
the input steps prior to registration. 

i TRIVIAL NAME 

A name or number which is not systemsitically derived by consideration 
of the structure of the corresponding compound. 

f 

VALENCE 



The sum of the H-count, the numerical value of charges, and the 
value of bonds (1,2, or 3) to an atom. (For the purposes of registration 
and searching.) 

VERIFY 



The process of assuring that recorded material agrees with the edited 
manuscript. The process may or may not include literal verification as 
commonly used in conjunction with keypunching. Verification can also imply 
proofing material to assure the computer records are correct in terras of 
released manuscripts. 



I 
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APPENDIX A 



An Overview Description of the CAS 
Chemical Registry System 



Extracted from Substructure Search ’’Background Information 
and Question Coding Instructions” 

Copyright 1968 by the American Chemical Society 
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Taken From 
Substructure Search 

Backgro\md Information and Question Coding Instructions 
Published by the American Chemical Society 
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Chapter 2: BACKGROUND INFORMATION ON THE CAS CHEMICAL 

COMPOUl^D REGISTRY SYSTEM 



2.1 Introduction 

The Chemical Compound Registry System* being implemented at CAS is a 
man-machine system for the storage and retrieval of chemical nomenclature, 
bibliographic data, and structural information. The basis for registration 
is a machine-derived notation unique to each compound and representing that 
compound's molecular structure in detail equivalent to that provided by 
the conventional structural diagram used as a communication tool by chemists. 
This chapter describes the Registry System with particular emphasis on 
those aspects of the system that influence Substructure Searching. 



o 

ERIC 



2.2 General Contents of the Registry Structure File 

The Structure File of the Registry System presently contains current in- 
formation on more than 1,000,000 substances, including those which have been 
indexed in Chemical Abstracts beginning with Voliuae 62 (196^), plus informa- 
tion from several other sources. All compounds having fully defined structures 
(with few exceptions — oligomers for example) are registered and are search- 
able in accordance with appropriate input structure-drawing conventions. The 
Registry includes organic compounds, inorganic compounds, and coordination 
compounds (compounds in which the number of attachments to a central atom — 
usually a metal atom, but not necessarily — exceeds the generally accepted 



* D. P. Leiter, Jr., H. L. Morgan, R. E. Stobaugh, "Installation and Operation 
of a Refcistry for Chemical Compounds," J. Chem. Doc. 5 (^)» 238 ( 1965 ). 
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valence of the central atom). Text descriptors are employed for different- 
iation of stereoisomers. Mixtures (with the exception of racemic mixtures) 
are not presently registered by structure and are ■'therefore not retrievable 
by substructure search. 

One group of partially defined structures, the definitive oligomers 
such as dimers, trimers, etc., are registered by structuring 'the monomer 
according to standard Registry procedures and by using a text descriptor 
to indicate the number or repeating units. Complete polymer registration 
procedures are still under development and will be implemented at a later 
date. 

Elements, their isotopes, and allotropes are registered in accordance 
with Chemical Abstracts indexing policies and are searchable as registered. 
Compounds having unknown or inconqpletely defined structures are not search- 
able by substructure since there are not entries in the Structure File. 
Methods for equalization of certain types of tautomers hav? also been 
incorporated. 



2.3 The Computer Structural Record 

The structure for each compound is stored in the form of a compact list* 
containing all non- H-atom connections, element symbols, bond values, H-counts 
for non-carbon atoms, ring closure pairs, and various modifications such as 



* cf. footnote page !• 




Table 2.3: REGISTRY REPRESENTATION K)R CERTAIN 

COMPOUNDS OR GROUTS 
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abnormal mass^ charge > and abnormal valence* The format of this cosqpact list^ 
as translated from the. tape version to a printed form,* is illustrated with 
annotations in Figure 2*1 for a compound regiring no special structuring con- 
ventions or machine manipulations to facilitate input* Those compound types 
requiring particxilar conventions for registration are discussed in Section 
2*4 with examples and explanations of the resultant structural records* 



REG NO 76133 
TEXT NS 



ATm NO 12345678 
CONN* 11112 2 2 

ELMENT CCCLPP FPP 
BOND -1 -1 -1 -1 -1 -1 -1 

H-COUNT 




5® pd 





Figure 2*1: C(M>ACT LIST FOR A SIMPLE GRAPH 



* A detailed description of the machine record format is available as 
part of the documentation of software for- the Registry System* 
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Description of Fields (cf. Figure 2.l) 

REG NO (Registry number) - This is the machine-assigned unique 

permanent address for a compound and its related information on all 
CAS Registry files. 

TE)CT - This field carries textual information qi^ifying the 
structural record of a compound (e.g., stereochemistry dnd repetition 

of structural \mits). 

ATOM NO - This field designates the machine numbers assigned to 
the non-^H atoms of the structure. The connection, element symbol, 
bond value, and ^H-count (for non-carbon a^.oms) appear in their re- 
spective fields in the column below the atom number. 

CONN. - This field carries the description of atom connections 
in the form of a "from list". It records the lowest numbered atom 
to which each atom in the AT(M NO field is attached (ring closures 
are handled separately). 

element - This field records the element symbol for each mchine 
numbered atom in the column beneath the appropriate atom number. Any 
element except can appear in this field. (Deuterium D and tritium 

"T" are handled as separate elements.) 

bond - This field is the description of the bond between an atom 
and the atom to which it is connected, as designated in the CONN, field. 
Each bond description consists of two characters - the first character is 
either the operator representing acyclic (chain) bonds, or the operator 
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representing cyclic (ring) bonds. The second character can be a 
niariber from 1 to ,5 which represents the following bond values: 



1 B single 

2 B double 

3 » triple 

4 » equalized tautomer (cyclic and acyclic) 

^ B corr^letely conjugated cyclic 

H-COUNT - This field records the number of H atoms attached to 
non>carbon atoms of the structure • The H-count is placed below the 
atom number of the respective atoms. 



With the fields outlined thus faxj the compact list of the structiare 
in Figure 2.1 would read as follows: 



Registry number 76,153 is the permanent ^dress of 
the con^)ound, and no stereochemistry (NS) has been 
described for the structure. There are 8 atoms in 
the structure: atom 1 is a carbon atom; atom 2 

is a carbon atom connected to atom 1 by an acyclic 
single bond; atom 3 is a chlorine atom cormected to 
atom 1 by an acyclic single bond; atom 4 is a fluorine 
atom connected to atom 1 by an acyclic single bond; 
atom 5 is a fluorine atom connected to atom 1 by an 
acyclic single bond; atom 6 is a fluorine atom con-^ 
nected to atom 2 by an acyclic single bond; atom 7 is 
a fluorine atom connected to atom 2 by an acyclic 
single bond; and atom 8 is a fluorine atom connected 
to atom 2 by an acyclic single bond. 



Thus, all atoms, bonds, and connections are completely and uniquely 
described for the structure in Figure 2.1. For more coniplex structures, 
as illustrated in Figure 2.2, more fields are required to describe the 
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REG NO 5611518 
TE3CT UB,l6A-I>REaN 



k^m NO 


1 


2 3 


4 


567 


8 


9 


10 


11 


12 


13 


i4 


15 


16 


17 


18 


19 


20 


CONN. 




1 1 


1 


12 2 


2 


3 


3 


4 


5 


5 


6 


7 


11 


11 


12 


14 


l4 


ELEMEI^ 


C 


C C 


0 


C C C 


C 


C 


0 


c 


C 


0 


c 


C 


C 


C 


0 


c 


c 


BOND 

H-COUNT 


*1 *1 *1 


^ *1 *1 


-1 


*1 


*1 


*1 


-1 


-2 


*1 


*1 


-1 


-1 


-1 


*1 


*1 


AT(»4 WO 


21 


22 


23 


24 25 


26 


27 


28 


29 


30 


31 


32 


33 


34 


35 


36 


37 


38 


CONN. 


15 


18 


19 


19 20 


22 


22 


23 


23 


23 


26 


28 


29 


31 


31 


31 


32 


37 


ELEMENT 


0 


C 


c 


P C 


C 


0 


C 


c 


C 


C 


c 


C 


C 


C 


C 


C 


0 


BOND 

H-COUNT 


-1 

1 


-1 


*1 


-1 *1 


-1 


-2 


*1 


*1 


-1 


-1 




*2 


•1 


•X 


-1 


«‘l 


-2 



RmO CIOSURE PAIRS 006*1009 d0»1011 (H.5*1019 025*1028 033*1037 





0 



;3 



0 



st7 




aa 



/? 



36 



3/ ^ 3 ^ 



3V 



■iiiiiiiiiiii^ 
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additional structural details. Figure 2,2 illustrates the treatment of 
a compound containing rings. The resultant compact list contains a field 
entitled RING CLOSURE PAIRS. 

RING CLOSURE PAIRS - This eight-character subfield lists two 
three -character atom numbers \d.th a two-character bond descriptor 
between them. This means that the t\70 atoms cited are connected 
by the cyclic bond value listed. The ring closure list for the 
structure in Figure 2.2 would read as follows: 

Atom 6 is connected to atom 9 by a single cyclic 
bond; atom 10 is connected to atom 11 by a single 
cyclic bond; atom 19 is connected to atom 19 by 
a single cyclic bond; atom 29 is connected to atom 
28 by a single cyclic bond; and atom 33 is con- 
nected to atom 37 by a single cyclic bond. 

The fields ATOM NO, CONN., ELEMENT, BOND, H-COUNT, and RING CLOSURE 
PAIRS collectively are termed the "graph proper" of the structural record. 

All other fields are collectively termed "modifications" and are explained 
below: 

AB VAL (Abnormal Valence) - Table 2.1 is a table of standard 

valences. For registration and searching, the valence of an atom 

is defined as the sum of the H-count, the numerical value of 

charges, and the value of bonds (l, 2, or 3) to that atom (e.g., 

©0 

in H3C-CH2-N=N»N, the atom valences left to ri^t are C=4, 

0=4, N«3> N=9> N=3)* The valences of atoms in a completely con- 
jugated cyclic system (or a tautomeric string) can be calculated 
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(Table 2.1: STANDARD VALENCES FOR THE ELEMENTS 



Symbol 



Element 



Ac 


Actinium 


Ag 


Silver 


A1 


Aluminum 


Am 


Americium 


Ar 


Argon * 


As 


Arsenic 


At 


Astatine 


Au 


Gold 


B 


Boron 


Ba 


Barium 


Be 


Beryllium 


Bi 


Bismuth 


Bk 


Berkelium 


Br 


Bromine 


C 


Caibon 


Ca 


Calcium 


Cd 


Cadmium 


Ce 


Cerium 


Cf 


Californium 


Cl 


Chlorine 


Cm 


Curium 


Co 


Cobalt 


Cr 


Chromium 


Cs 


\ 

Cesium 

\ 


Cu 


Copper ^ 


D 


Deuterium 



Valence 

3 

1 

3 

3 

3 

1 

1 

3 

2 

2 

3 

3 

1 

h 

2 

2 

3 

3 

1 

3 

2 

6 

1 

2 

1 



Symbol 



Element 



15 



Valence 



Dy 


Dysprosium 


3 


1 

i 

ij 


Er 


Erbium 


3 


1 

1 


Es 


Einsteinium 


3 


1 

i 


Eu 


Europium 


3 




P 


Fluorine 


1 




Pe 


Iron 


2 




Pm 


Feridium 


3 




Pr 


Francium 


1 




Ga 


Gallium 


3 




Gd 


Gadolinium 


3 


' I 


Ge 


Germanium 


k 


; 


H 


Hydrogen 


0 




He 


Helium * 


0 


1 


Hf 


Hafnium 


4 


1 

1 


Hg 


Mercury 


2 


1 

i 


Ho 


Holmium 


3 




I 


Iodine 


1 


{ ^ 

' 1 


In 


Indium 


3 


'i 


Ir 


Iridium 


2 




K 


Potassium 


1 




Kr 


Krypton * 


0 


• 

• i 


La 


Lanthanum 


3 


• 

• 


Li 


Lithium 


1 




Lu 


Lutetlum 


3 




Lw , 


Lawrencium 


3 




Md 


Mendelevium 


3 




Mg 


Magnesium 


2 





* An oxidation number \of zero (0) is used for the rare gases. 
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Table 

Symbol 


2*1 ! (continued) 
Element 


Valence 


Symbol 


Element 


16 

Valence 


Mn 


Manganese 


7 


Ru 


Ruthenium 


2 


Mo 


Molybdenum 


6 


S 


Sulfur 


2 


N 


Nitrogen 


3 


Sb 


Antimony 


3 


Na 


Sodium 


1 


Sc 


Scandium 


3 


Nb 


Niobium 


5 


Se 


Selenium 


2 


Ne 


Neon * 


0 


Si 


Silicon 


4 


Nd 


Neodymium 


3 


Sm 


Samarium 

• 


3 


N1 

• 


Nickel 


2 


Sn 


Tin 


4 


No 


Nobelium 


3 


Sr 


Strontium 


2 


Np 


Neptunium 


5 


T 


Tritium 


1 


0 


Oxygen 


2 


Ta 


Tantalum 


5 


Os 


Osmium 


2 


Tb 


Terbium 


3 


P 


Phosphorus 


3 


Tc 


Technetium 


7 


Pa 


Protactinium 


5 


Te 


Tellurium 


2 


Pb 


Lead 


4 


Th 


Thorium 


4 


Pd 


Palladium 


2 


T1 


Titanium 


4 


Pm , 


Promethium 


3 


T1 


Thallium 


1 


Po 


Polonium 


2 . 


Tm 


Thulium 


3 


Pr 


Praseodymium 


3 


U 


Uranium 


6 


Pt 


Platinum 


2 


V 


Vanadium 


5 


Pu 


Plutonium 


4 


W 


Tungsten 


6 


Ra 


Radium 


2 


Xe 


Xenon 


4 


Rb 


Rubidium 


1 


y 


Yttrium 

« 


3 


Re 


Rhenium 


7 


Yb 


Ytterbium 


3 


Rh 


Rhodium 


2 


Zn 


Zinc 


2 


Rn 


Radon * 




Zr 


Zirconium 


4 


* An 


oxidation number 


of zero (0) 


is used for the 


rare gases. 
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by using one of the Kekule forms (as or>posed to the machine equalized 
form) as the basis for determining bond values and count (e.g., 




0 

^ — C — OH instead of 

would be used as the basis for calculating valences). Valences other 
than those given in Table 2.1 (Standard Valences for the Elements) are 
recorded as modifications in the field designated "AB VAL". For example, 
the abnormal valence field of Figure 2.3 contains the following; AB VAL 
003-005 025-006. This would read: atom number 3 has a valence of 5> 

atom number 25 has a valence of 6, etc. 

AB MASS (Abnormal Mass) - The structure registration process checks 
abnormal mass citations against a list of acceptable abnormal masses 
(Table 2.2). This list includes the abnormal masses most commonly 



cited in chemical texts, reference works, and abstracts. 



Table 2.2; TABLE OF ACCEPTABLE ABNORMAL MASS VALUES 



Element Symbol Acceptable Mass Values 



Au 


195, 198, 199 


Br 


77, 79, 81, 82 


C 


11, 13, 


Ca 


45, 47 


Cl 


36, 38 


Co 


56, 57, 58, 60 


Cr 


51 


I 


124, 125, 129, 131, 132 


K 


42 


N 


15 


Na 


24 


0 


17, 18 


P 


32 


S 


35 


Sr 


90 



o 













I 



I 



i 



r* 
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REG NO 10380162 
TEXT CIS 



ATOM NO 

CONN. 

ELEMENT 

BOND 

H-COUNT 



123^^56789 

112333^5 

CCNCCCCCC 



10 11 12 13 14 15 16 17 

8 10 11 12 13 1^ 15 16 

CCCCCCCC 



-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 -1 



ATOM NO 

CONN. 

ELEMENT 

BOND 

H-OOUNT 



18 19 20 21 22 23 24 25 26 27 28 29 30 

17 18 19 20 21 22 24 24 25 25 25 26 

0 S C 0 



C C C C C C 

-1 -1 -1 -1 -1 -1 



-1 -1 -1 



0 

-2 



0 C 

-2 -1 



X , 



AB VAL 003-005 025-006 
CHARGE 003 , +1 027,-1 
M.A.F. 024,01/001 



ratio to 1st fragment 



beginning atom of 2nd fragment 



7 
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56 



30 



Figure 2 . 3 : COMPACT LIST FOR A GRAPH CONTAINING A MULTI-ATOM 

FRAGMENT 
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Acceptable abnormal masses are recorded as modifications in the field 
designated "AB IVIASS". For example, the abnormal mass field in Figure 
2.4 contains the folloid.ng! AB MASS 001-032. This would read: 
atom number 1 has a mass of 32. 

CHARGE - Charges on atoms are recorded as modifications to the 
graph proper in a field designated ’’CHARGE**. Charges in the range 
-9 through +9 are acceptable, and they must be linked to specific 
atoms in the machine record. (See conventions for coordination com- 
pounds in Section 2.4). For example, the charge field of Figure 2.3 
contains the following: CHARGE 003, +1 027# -1. This would read: 

atom number 3 ^^as a charge of +1, atom number 27 has a charge of -1, 
etc. 

S.A.F. (Single -Atom-Fragment) - See dot -disconnected conventions 
in Section 2.4. 

M.A.F. (Multi -Atom-Fragment) - See dot -disconnected conventions 
in Section 2.4. 

TAUTOMER ENDPOIHTS - See tautomer description in Section 2.5* 

2.4 Structuring Conventions and Resultant Computer Records 

The structuring conventions employed to input structural diagrams must 
result in consistently generated compact lists to insure unique registration 
and reliable retrieval of structural information. This section provides 
information on those conventions which must be considered when phrasing 
substructure search questions. 
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Table 2 . 3 ; REGISTRY REPRESENTATION K)R CERTAIN 
COMPOUNDS OR GROUPS 



Name 



FormiJla 



■RegTaUry 

Representation 



Azide 



©, 

M, (inNsWj) 



© © © 

N=N=N 



.’zido 



Carbon monoxide 



-N, (in RN3) 



© © 

-N==^N=N 



00 



© 0 

CsO 



Diazo 



Dlazonlum 



sNg (in CH2N2) 



© © © 

-N2 (in RNa Cl ) 



© © 

=N=N 



© 

-NsN 



Pulminic acid 



H 0 N;C 



© 0 

HO— NSC 



Isonitrile 



Nitrate 



-NC (in R-NC) 



NO3 (in NO3 
RONO2 ) 



© © 

—NSC 



0 =N 



© 

-0 



Nitric acid 



HNO, 



OV H 

iV 

0 =N— 0 











liliiiii 
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Table 2e3 (continued) 



Nitrite 


© 

NOz (in NOg 
RONO) 


0 
II 

1 

0 (i) 

0 

II 

1 

0 


Nitro 


-NOg 


■ 

0= 


) 

1 

T— 


Nitroxide 


no (in R^no) 


1 

— N— 0 • 


Perchlorate 


0 

CIO 4 (in CIO 4 
ROC 103 ) 


0= 

1 

( 


w — w 

0 

►4 

0 

II 


) 

1 

31-0- 

1 

) 


Phosphate 


t)A f 4 tn^ 

PO 4 (in PO 4 
RaPOj 


( 

( 

0=] 

( 

( 


=) 

0 ^ 

>-o o=: 

) ( 

=) 


)— 

>-0- 

)— 


Phos^dioric acid 


H 5 P 0 , 


{ H," 

“V 


\ 

1 

>— 0^^ 
3 


Sulfate 


SO 4 (in SO 4 “ 

R2S04 ) 


®i 

0—1 

1 

( 


W — UJ— w 
1 

0 © 

% 

1 

0 

1 


3 

II 

5~0— 

II 

3 




m 



gllll_ 



WMmm 
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Sulfite 


so, ( in SO,® ®*‘ 

R,SC^ 


0 

© II 0 
o-s-o 


Sulfuric acid 


HjSO^ 


iJiLh. 

o-s-o 

II 






II 

0 
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Rigorously Defined Structures - Certain compounds and functional groups 
have been rigorously defined to insure consistent registration and retrieval. 

A number of these are listed in Table 2.3 with their registry representations. 

Dot-Disconnected Conventions - Amine salts, hydrates, and 77^-complexes 
are, in general, handled as addition confounds. The interconnections between 
the struct\iral formulas of the various components are shown by a dot (e.g., 
CH 3 NH 2 . H 2 SO 4 ) . Metal salts of oxy acids ( carboxylic, sulfonic, etc. ) and metal 
derivatives of many other functional groups containing -OH, -SH, -SeH, -TeH, 
and -NH are handled by a similar dot -disconnected convention *. That is, the 
structurfitl formula of the acid (or other compound) is followed in turn by a 
dot, a coefficient depicting the ratio of the first fragment, and the metal 
(e.g., CHsCOOH'ICa; CHsOH'Na). 

Table 2.4 contains the list of elements defined as nonmetals for the 
purposes of registration and searching. 



Table 2.4: LIST OP NONMETALS 



Antimony 


Carbon 


Krypton 


Selenium 


Argon 


Chlorine 


Neon 


Silicon 


Arsenic 


Fluorine 


Nitrogen 


Sulfur 


Astatine 


Helium 


Oxygen 


Tellurium 


Boron 


I^drogen 


Phosphorus 


Xenon 


Bromine 


Iodine 


Radon 





* Except when a miOitivalent metal atom is in a ring or unsymmetrical 
structxire as in: 
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Metal salts or derivatives of organic compounds containing functional 
groups of the nonmetals other than N> 0> S> Se> and Te are structured and 
stored using bond connections (i*e*> they are not structured by the dot- 
disconnected convention). Metal derivatives of strictly inorganic hydrides 
of all nonmetals (including 0, S, Se and Te) are registered manually 
and> therefore^ are not searchable by structure (e*g*> NaDH> LiHH^^* 

M.A* F* (Mxilti -Atom-Fragment) - Figure 2*3 gives an exairple of a dot- 
disconnected structure in which both fra^ents contain more than one 
non-^H-atom. Such fragments are recorded as part of the graph proper \rith 
discontinuities marking the separation of fra^ents* The ratios of multi- 
atom-fragments to the first fragment of the struct\ 2 re are entered as a 
modification in the Multi-Atom-Fragment (M*A*F*) field* Thus> the M*A*F* 
field of Figure 2*3 i-^ould read: the fragnent beginning ^Tith atom 2^1^ in 
the graph proper has a ratio of 1 to 1 with the first fragment of the graph 
proper* 

S*A*F* (Single-Atom-Fragment) - Figure 2*4 gives an exanqole of a compact 
list representing a dot-disconnected structure in which one portion of the 
structure is a single atom of the metal sodium* Vhen one fragment of a dot- 
disconnected structure contains only one non-^H atom or a non-^H atom with 
attached H (e*g*, H 2 O, WHj, and"OH), no indication of this fragment is given 
in the graph proper* Instead^ a special Single “Atom-Fragment (S*A*F*) field 
is set up as a modification designating the single atom of the fragment 



’^Additional details on the treatment of metal salts and o’ther metal 
derl’vatlves are available in "Chemical Abstracts Service Chemical 
Compound Registry - Structxire Conventions" January 31> I 968 * 
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^d.th its ^H-count, valence, abnormal mass, charge, and ratio to the first 
fragment of the structure* Thus, the S*A*P* field of Figure 2*4 would 
read: there is a single-atom-fragment consisting of Na Xfdth a H-count of 

000, a valence of 001, a nomal mass (indicated by 000 in this position), 
a charge of -K), and a ratio of 2 to 1 to the first fragment of the graph 

proper* 

Inorganic Compounds - Those compounds containing no carbon in which 
the standard valence of each atom is filled, but not exceeded, are struc- 
tured using the same covalent structuring conventions used for organic corn- 
el 

pounds (e*g*. Cl— ?— Cl and Na— Cl)* Those inorganic compounds in which the 

number of connections to a central metal atom exceeds the value of the 

oxidation state ^ for that atom are structtired using the conventions for 

coordination compounds outlined below* Metal salts of inorganic oxo acids 

(and their S, Se and Te analogs) are structured using the dot-disconnected 

0 

convention (e*g*, HO— P— OH • 3/2 Ca)* 

OH 



Coordination Compounds - The structuring conventions for metal coordina 



tion compounds are designed for consistency in structural representation and 
consistency in ncanenclature* The system which has been adopted indicates 
charges on certain classes of ligands,^ and also indicates the oxidation 



state of the central atom as the apprc^riate charge associated with that 
atom (this latter feature provides an efficient means of searching for 
coordination compounds as a class and searching for coordination compounds 




♦ See Glossary for definition* 
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of a specific element: a metal atom with a non~H connection and an indi- 

cated charge)* 

The following are some specific coordination compound conventions:* 



1 * 



2 * 



k. 



An acyclic single bond is to be used to show the connection 
between the metal and the group connected (ligand)* 
Stereochemistry is expressed by the usual text descriptors, +, 
syn, etc* Additional coordination compound text descriptors 
for the appropriate isomerism are: antiprismatic, bipyramidal, 

dodecaliedral, planar, prismatic pyramidal, octahedral, and tetra- 



hedral* 



Charges are to be shown on the metal and the appropriate atom(s) 
of each "ionic” ligand* Example: 



N^C 









N 



Fe 



A hydrogen ion (H+) or a metal ion (such as Ka+, Mg+^, K+) in- 



volved as a companion cation to a metal coordination anion is 
structured as such, in a dot -disconnected form, with the positive 
charge (s) indicated* 

Coordination compounds involving charge -delocalized ligands are 
not presently input by structure, except for the anions of 
p-diketones and P-keto esters (and their thio and seleno analogs), 

These latter ligands usually form chelate rings with the metal 



*For more detailed explanation and examples of structuring, see footnote on 
page 25, 
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6 . 

7 . 

8 . 



using both 03Qrgen (S> Se) atoms and ionizing a hydrogen atom 
from the carbon in between the oxygen atoms of the carbonyl 
groups. Therefore, a negative charge is sho^«m on the carbon atom 
between the carbonyl groups, and appropriate positive charges(s) 
are shown on the metal. 

The carbonyl group (CO), as a unidentate=^ group, is represented 
with a carbon-oxygen triple bond (C«0). The carbonyl group as a 
bridging ligand is represented as — ^ 

^c=o 



The nitrosyl group (NO) is represented with a nitrogen-oxygen 

triple bond and positive charge on the oxygen: N~.0 

Anions of a a-amino (hydroxy, mercapto) acids and p-amino (hydroxy, 

mercapto) acids usually form chelate rings with the metal (ionizing 

the hydrogen of the CO^ group). The bonding is through the N (O, 

. e 

S) and the -0 portion of the carboxylate group. However, such 
con 5 )ounds involving the metals Ba, Ca, Cs, Pr, K, Li, Na, Ra, Rb, 
or Sr are structured by dot-disconnected conventions instead of 
coordination compound conventions. 



Stereochemistry - Stereochemical infomation is recorded by text descrip- 
tors in the modification field ”TE3CT". The Substructure Search System retrieves 
all structures which satisfy the constitutional (two-dimensional) requirements 
of a substructure. The chemist then screens this set of potential answers for 
those which satisfy the stereochemical requirements as well. 



^ See Glossary for definition. 



















1 
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Text descriptors have been developed for the main body of steroids^ 
terpenes, alkaloids^ and carbohydrates based upon an alphabetic terra^ or 
base name^ representing a basic parent structure ^dlth implied stereochemistry 
at specified positions. This base name is the name of the parent structure 
or a term closely related to it. 

2.5 Mechanized Treatment of Tautomers and Completely Con.iugated Cyclics 

Tautomers - The Registry System \dll provide programmed identification 
of certain types of unique confounds that can be represented by two differ- 
ent, but equaXLy valid structural diagrams. A generalized representation 
is: M=r Q— ZH^HM~^=:Z 



where M, Q, and Z are combinations of As, Br, C, Cl, I, N, 0, P, S, Sb, Se, 
and Te (including abnormal mass analogs), "n;'’ represents a double bond, 

represents a single bond, and H must be present as shown. The following 
are examples of the tj»pes of structures involved: 
a) General representation M=Q— ZH 
M and Z are N (trivalent) 

Q is C 



- 
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b) 



General expression M — Q — ZH 

M, Q, and Z are N (trivalent) 

> 



Me- 



N 



■N^ 

H 



PI>NH— N— M 



c) General expression M=Q~“ZH 



O 

„-i 

! 



SH 



OH 



HO 



II 



Afe 






1 



H 





^ pf,N=rA^— N« 




Br 



M and Z are 0, S, Se, or Te (or 
abnormal mass analogs) 



Q is N, P, As, or Sb (tri- or 
pentavalent) S, or Se (tetra- 
or hexavalent), Cl, Br, or I 
(any valence), or C (tetravalent ) 



O 

II 



/vjQ — c — SH 



OH 



Mt- 



mk 



\ 



h 



I 





















m 
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These general expressions cover those fonos which differ from each 
other in the location of a mobile hydrogen atom on the endpoints of a 
three atom string. The bonds involved in the string must both be either 
cyclic or acyclic. 

When a tautomer string is recognized by program^ the bonds in the 
string are assigned a bond value of 4 together \d.th the appropriate bond 

t 

operator. The hydrogen atom being shared by the endpoints of the string 
is not recorded in the H-COUKT field of the graph proper, but the endpoints 
of the string are recorded in a modification field designated TAUTOMER 
ENDPOINTS. It is understood that the two atoms in each of these tautomer 
endpoint pairs are sharing one hydrogen atom. Deuterium and tritium are 
not treated as hydrogen atoms for the purposes of tautomerism and must be 
accounted for in searching when they are acceptable substitutes for normal 
hydrogen. 

Completely Con.jugated Cyclics - The Registry System accepts input of 
alternating double and single bonds in cyclic structures. That is, no 
specific indication of aromaticity, other than conventional bonding, is 
made at input. In the registration process, programs identify those bonds 
which are part of a conqpletely conjugated cyclic system and equalize them 
(i.e., all bonds in such a system are given the same bond value). Such 
bonds are represented by the descriptor "*5" wherever they appear in the 
stored structure record. Whenever a coimpletely conjugated cyclic bond so 
identified by program is also in a tautomer string recognized by program, 
that bond will be represented in all stored records by the descriptor "*5"« 
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The Chemical Compound Registry 

by 

Margaret K. Park 

Chemical Abstracts Service, The Ohio State University, Columbus, Ohio 

Why a Computer"Based Compound Handling System 

The unifying theme of the chemical literature has been the emphasis 
on detailed understanding of the structural characteristics of chemical 
compounds and materials. The primary literature of many natural and 
physical sciences often has a relatively shoii;, useful life, but even 
today it is not uncommon for a working chemist to go back 60 years or 
more in the literature to find immediately useful recorded data. Signi- 
ficantly, that portion of the chemical literature that has a long useful 
life includes data associated with substances having identified structure 
and/or composition. The indexes of the secondary publication in chem- 
istry make access to the compound data readily available. In this respect, 
the 3 to 4 million compounds already recorded in the past volumes of 
Chemical Abstracts and Beilsteins Handbuch der Organische Chemie repre- 
sent a valuable collection of information to the chemical community. 



Presented at a Joint Meeting of the Special Libraries Association Chemistry 
Division and the American Chemical Society Chemical Literature Division. 

May 4, 1968. 
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Knowledge of molecular Dti*ucture provides a wide range of information 
about chemical reaction, physical properties, and biological activity. 

And, with understanding of structural chemistry rapidly growing in im- 
portance, chemical problems are increasingly being approached from the 
structural viewpoint. However, ot the present time, many potentially 
valuable correlations between structural features and chemical and physi- 
cal properties are not attempted because it would req.uire too much human 
effort to select and compare the required features in large data collec- 
tions. Moreover, existing systems in the traditional form of books or 
printed files cannot be easily reoriented to meet new needs. To combat 
these problems, computer-based methods are being called into play. The 
versatility of the computer in organizing and correlating the data makes 
possible greater flexibility in use of the files, and thus ensures a 
long useful life for the stored information. 

Background on the Choice of a Notation 

A prime concern of Chemical Abstracts Service in developing its 
computer-based compound handling systems is to design a system sufficiently 
complex in handling detailed data to provide good service to the chemical 
community yet simple enough in operation to be economically feasible. 

The extremely large size of the operation cannot be overemphasized 
as an important factor in the design of the system. The Registry System 
has been in operation since 1965 and now has computer files of over 
900,000 different chemical compounds and their associated one million 









names and 1.7 million bibliographic citations. These 900,000 different 
compounds represent the processing of 1.75 million references to com- 
pounds occurring in the literature. The Registry System has identified 
about half of the scattered references as being duplicates of compounds 
already on file. Some 4,000 new compounds are added to the files each 
week; reference citations to the compounds appearing in the current 
literature are being added at the rate of 9 >000 week. 



Such a sizeable collection must be compatible with other collections 
of compound-related data, regardless of their form, if it is to be use- 
ful. For this reason, CAS has, as a service to the Chemical community, 
designed a system large enough to handle the full i^nge of chemical 
literature, detailed enough in its own computer records to provide for 
multiple uses of the information and flexible enough to accommodate the 
wide variety of users who depend on these services. 

Purpose and Organization of the Registry System 

The Registry System is a computer-based identification system which 
unlg.uely identifies chemical compounds on the basis of their structural 
diagrams. The heart of this system is a computer program which generates 
a unique notation for each different compound recorded in the Registry 
files. Each machine notation is a detailed description of the two- 
dimensional graph of a chemical compound in an atom-by-atom, bond-by-bond 
listing. This chemical structure notation record is consistent with the 
amount of detail used by most working chemists. Further, it can be 
adapted to contain more detail, such as bond length or bond angles, should 
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this be desirable^ and it can also be adapted to provide lesser detail 
for interfacing with systems that do not use atom- by-atom records. 

There are at the present time three principal files of data contained 

in the Chemical Compound Registry System— the atom-and-bond records Just 

described in the structure file ; a file of various forms of nomenclature 

that have been associated with these compounds; and bibliographic clta- 
« 

tions to the CAS publications such as Chemical Abstracts (CA) and Chemi- 
cal-Biological Activities ( CBAC ) • Registration of compounds from CA 
began on a routine basis with CA volume 62 (January-June I 965 ) and has 
continued since that time. In addition^ the fluorine- containing organic 
compounds indexed in all volumes of CA from volume 1 to the present and 
in Bellsteins Handbuch der Organische Chemie have been registered. Also 
included in these files are compound data from many well known reference 
works such as the Merck Index , the Colour Index , some sections of the 
Code of Federal Regulations , United States Adopted Names , and several 
other references. The information, in the three Registry files — and indeed 
in all other CAS computer files that contain compound- related data— is 
tied together by Registry Numbers. A Registry Number is assigned to each 
structure on the basis of its unique notation when the structure is first 
entered into the file. Whenever a structure which is already present on 
the structure file appears in a new source^ the previously assigned num- 
ber is recovered automatically. The Registry Number functions as the 
machine address within the associated files of structure^ nomenclature^ 






im 
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bibliographic data^ index entries, etc. Tlie number is not itself a 
notation that can be translated to the structural record. 

Initially, the system was designed to handle carbon-containing 
organic compounds for which a fully defined two-dimensional structure 
could be drawn. Since that time, research and development work on both 
the chemical aspects of the Registry System and on the computer equip- 
ment and programming aspects has made it possible to extend the registra- 
tion process to mixtures. Inorganic compounds, metal coordination com- 
pounds, and to some classes of polymers and partially described structures. 

Computer Requirements for Notation 

Because each compound filed in the CAS Registry System is assigned 
its own unique machine- readable notation, it is appropriate at this 
point to review the requirements for a structure notation. Webster de- 
fines a notation as "a system of characters, symbols, or abbreviated 
expressions used in an art or science to express technical facts, quanti- 
ties, or other data". In an extensive information system, the processes 
of identification, filing, and retrieving compound- related data should be 
handled automatically by computer processes as much as possible. It is 
particularly important that the CAS system, which may eventually include 
several million chemical compounds in one file, identify synonymy between 
alternate representations of compounds. The ultimate size of the file 
alone dictates the desirability of eliminating unnecessary, redundant 
information. But more important, the efficient and economic retrieval 
of Information requires the ability to identify synonymy, whether it is 
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for the purpose of collecting all data at a single point in listings such 
as the CA Subject Indexes or for retrieving data via a computer search. 
Effective identification of synonymy requires a unique representation 
for each compound in the system. 



Because it is to be used in a large, product ion- oriented system, the rules 
governing the notation must be under the control of the operating organi- 
zation to assure the integrity of the data files. An evolutionary develop 
ment, such as occurs with nomenclature, causes continual updating of the 
record as new rules are applied. For large files it is not economically 
feasible to continually update the collection. 

The system must also be highly reliable - -it must produce the same 
notation for the same compound no matter what the variables in input. 

It must be ’’fail safe" against keyboarding errors and inconsistencies in 
input. The notation should be concise and not redundant , consistent with 
the level of detail stored. At the same time, it should be comprehensive 



enou^ to cover all of chemistry and all classes of compounds. Finally, 



the notation must be flexible enough to interface with the various techniques 
used throughout the chemical community; useful in a variety of applications; 
and economical to generate, store, and utilize in the large scale operational 
environment. 

CAS Machine Notation 

The CAS notation is a detailed inventory of the atom and bond components 
of the structural diagram which are computer-arranged into a unique cipher. 



The notation for CAS use must also be a stable representation. 
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This inventory, usually called a "connection table," is an ordered list- 
ing of the atoms and bonds, and of the drawing and the manner in which 

they are connected. 

The computer program that generates the unique form of the cipher is 
the very heart of this system. This program was developed by Morgan of 
CAS based on the work of Gluck at DuPont. It has been mathematically 
verified to assure that it does, in fact, always result in the same 
notation for a given compound. Moreover, this technique also handled any 
graph of points connected with lines 5 therefore, all chemical compounds 
discovered in the future can be converted to machine notations with no 
modifications of the existing rules. 



Stereochemistry 

Third-dimensional features are recorded in Registry files by con 



ventional stereochemical descriptors such as erythro and threo, D and L. 



Ihese features are determined from the original document by the chemist 
who prepares the structural diagram. Methods for recording node-by-node 
stereoisomerism within connection table itself have been described by 
Petrarca, Push, and Lynch. ^ These techniques are being evaluated to de- 
termine their economic feasibility in relation to the number of compounds 
in tee literature that provide stereochemical detail sufficient to warrant 

their use. 

Methods of Input- DATA 

Alternative methods of input to the structure file help to build a 
flexible registration and filing system. For greatest efficiency, the 
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system must input data in many forms and be adaptable to the use of many 
types of equipment. All processing programs in the Registry System, there- 
fore, are intentionally designed to operate independently of the input 
description of the compound’s structure and of the equipment which pro- 
cesses the data into machine readable fom. 



Connection Table 



The structural records can be input using a variety of data, records. 
Structural diagrams, such as those typed on structure typewriters, are 
input to the Registry System. Atom-bond connection tables can be entered 
directly into the system. Algorithms for translating systematic noraen- 



c3jature and other forms of notations have been described in the literature, 



Regardless of which fonn is used, the result is an atom- bond listing of 
the structural diagram. 



nomenclature Translation 



Computer translation of systematic nomenclature to the connection 
table record should provide an economical method for adding to the com- 
puter files the three to four million compounds that appeared in the in- 
dexes prior to I965. Up to now, almost all of the compounds in the Regis- 
try files have been input via structural diagrams which were hand drawn by 
professional chemists and translated into machine language through cleri- 
cal effort. The task of registering the pre-1965 material, however, will 
not be feasible— in terms of time, manpower, or dollars— if it is to re- 
quire the preparation of a structural diagram for each compound and the 
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subsequent conversion of that structural diagram to machine language for 
registration. Therefore, CAS Research and Development staff has con- 
centrated on the development of translation procedures which permit 
direct computer translation from nomenclature to connection tables for 
use in automatic registration. This work — called Nomenclature Translation — 
has resulted in an algorithm or set of procedures that will allow the 
handling of an estimated 60^» to 70^ of the names of organic compounds in 
current and past CA Formula Indexes, Programming is now well undenmy 
at CAS, 

Methods of Input - Equipment 

A wide variety of equipment can be used to convert structural data 
to machine- readable foiro. Keyboarding can be done on devices such as the 
standard keypunch, paper- tape- generating typewriter, and on various magnetic 
type recording devices. Structure typewriters, such as those developed by 
Mullen^ and Feldman^, are also in use, A modified version of the Mullen 
typewriter is presently being used in CAS operations for much of the routine 
input to the Registry, These devices have all been used within CAS opera- 
tions, often simultaneously. 

Optical scanning equipment has been developed by Badische Anilin- und 
Soda-Fabrik to automatically scan structures into the computer and convert 
them to connection tables, CAS has explored optical scanning, but the 
equipment required to support our high- volume input is not available. 

The form of input chosen and the equipment used is determined by the 
use to which the data is to be put. The ]jarge- scale input required at CAS 















- 10 - 



dictates that the method chosen must be quick, efficient, and precise. 

Such a concentrated input of compounds from the full range of chemical 
literature justifies special purpose equipment, such as structure- typing 
typewriters, which are not needed for smaller operations. However, CAS 
will always maintain the flexibility to accommodate a wide fange of equip- 
ment and input forms. 

Automatic Editing 

To maintain a store of accurate information in the Registry file, a 
computer editing program has been developed which automatically detects 
errors introduced during the structuring and keyboarding operations. The 
detailed connection tabic aUows a relativelj^ simple and straight- forwa.rd 
approach to editing this notation. T\w important characteristics of the 
record are used for the programmed edit checks: (l) the high degree of 

redundancy CAS has imposed on the input conventions, and' (2) the syntax 
of the chemistry inherent in the notation. 

Using these tools, the editing program applies some 50 different 
checks to the connection table records. Each atom is checked to verify 
that it is a valid atom element symbol and that the value of the bonds 
attached is equal to a stored value for that element. Uncommon element 
valences, such as those that occur in free radicals and carbenes, are 
always recorded in the input data. Additional editing checks have been 
added to the 36 O progiam to check the valence and charge of metal ions as 
well as the validity of the coordination number (i.e., the number of ligand 
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attachments) . Tlie atomic mass of isotopically labeled elements is com- 

; 

pared with a table of permissible values. Similar validity checks 

have also been incorporated for frequently occurring stereochemical j 

descriptors . 

I 

From the information given in the table ^ the program also computes | 

• i 

a molecular formula, including the hydrogen count, and compares it to | 

. I 

that calculated by the chemist and input by the keyboard operator. Any ] 



discrepancy between the t\;o molecular formulas constitutes an error. 

When any error is detected by the edit checks, the table is barred from 
further Registry processing, and the table is rejected with an appropriate 
diagnostic message. After being corrected the connection table is again 
recycled through the edit program to assure that no new errors have been 
introduced. 

The editing program also includes features that detect alternate 
structural representations for certain cJLasses of compounds and convert 
the different forms to consistent, unambiguous descriptions for subsequent 
geneiation of the unique notation. One such class of compounds is the 
conjugated ring system, such as the resonance structures of the dichloro- 
bensene derivatives. The alternating single and double bonds enclosed 



I 

I 



in any cyclic system are identified, and the bonds recoded as normalized, 
equivalent bonds. Automatic identification of such conjugated ring systems 



« 

equates the alternative input records and provides an unambiguous descrip- 



tion for the compound in the computer record yet allows the chemist to 
draw the diagram in the conventional manner. In a similar manner, tautomeric 
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systems are also automatically recognized and recoded. Harmonization 
(normalization) of tautomeric compounds has been deliberately limited to 
types of tautomeric structural units found in such compound classes as 
the amidines, thio carboxylic, phosphoric and arsonic acids, guanidines, 
and imidazoles. An illustration of these types is the benzimidazole 
structures on the slide. KetO“enol tautomers are not eo.ualized in this 
manner, although the computer routine that identifies these structural 
conditions is a general one that can be extended to handle this and 
other tautomeric classes if desired. 

Another feature of the editing program adds greater specificity to 
the input record of the structure. This program identifies and differen- 
tiates acyclic and cyclic bonds and assigns the appropriate code to each. 
This additional detail provides improved discrimination in the retrieval 
of compounds on the basis of structural characteristics. 

The Notation File 

A master file of the unique notations is maintained as a part of the 
CAS Registry. As each new notation is added to the master file, the next 
available Registry Number is assigned to that notation. All subsequent 
entries of the same compound result, of course, in an identical notation 
that is compared with the master structure file to show that it is the 
same compound. When the notations match, the previously assigned Registry 
Number is retrieved. 

The form of notation stored on the master file is more compact than 
is the connection table used for input. Redundant information, which is 




useful for editing the keyboard records, is not necessary after the table 
has been verified and accepted for generation of the unique notation. 

The redundancy then is removed before the notation is stored in the 
master file. The organization of the file itself permits an even more 
compact record. The lexicographic ordering of the file brings together 
groups of compounds with similar structural features. All acyclic struc- 
tures, for example, are placed before any cyclic structures. Further, 
all compounds are grouped together which have the same graph. Thus, the 
hierarchy of notation order permits elimination of the duplicate portions 
of adjacent records to decrease storage space requirements. Compaction 
of this type decreases the file to approximately 64^ of the size it would 
be if the entire notation were stored. In terms of file size, more than 
100,000 compounds can. be stored on each reel of magnetic tape. Thus, a. 
single reel of tape will suffice for small to medium size files. 

Automatically Generated Cross References 

One significant advantage in a file organized into groups of closely 
related structures is the ability to automatically generate cross references. 
All registered stereoisomers occur adjacent to the non- stereospecific struc- 
ture on the file. Isotopically labeled compounds are grouped with the 
unlabeled isomer. Amine salts, like the hydrohalides, are cited under the 
structure of the amine, while all meta3JLic salts of an organic acid are 
grouped with the free acid. File hierarchy and the organization which it 
permits have helped in developing the cross reference feature of the faceted 
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numbers which appear in Chemical-Biolosical Activities . Within any series 
of related compounds the faceted numbers for each isomer or salt include 
the Registry Number of the patent. Faceted numbers do not replace the 
Registry Numbers, but serve as cross references between acids and their 
salts, bases and their salts, groups of quaternary compounds that have 
a common cation, and stereochemical isomers. 



Applications for Substructure Search 

The Registry structure file provides not only the means of uniquely 
identifying structures, but also the data base for substructure search. 

The notation record permits the direct application of many substructure 
searching techniques. Since the notation explicitly Includes the elements, 
their connections and the bond values and bond types, no expansion of 
the notation is required to effect an atom-by-atom search record. The 
Substructure Search System locates within the structure file or a sub- 
file all compounds that share particular substructures or structural frag- 
ments. 

The basic search system has been designed for maximum flexibility, in 
both the degree of specificity of the questions which can be posed to the 
system and the degree of specificity of the answer desired by the questioner . 
For example, one questioner may desire exact answers to his substaructure 
search, while another user may want a set of answers which Includes exact 
answers plus a selected number of closely related answers. This second 
option allows "browsability" which is a creative Stimulus to the researcher. 
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Of course, economics, too, play a pax*t In determining the most desirable 
level of results. 

For greater flexibility, the search system is designed to permit 
several levels of retrieval specificity. A fragmentation search technique 
analogous to the widely used manual and punched card systems provides an 
economical retrieval tool. Chemical fragment screens, many of which 
correspond to the traditional functional groups, are generated automati- 
cally by computer program from the structure notation record. These types 
of structural screens are shown on the slide, vrhere the traditional chemi- 
cal functional groups Include such groups as carbonyl, nitrile, amide, 
sulfonic acids, and derivatives. Some generic structural features are 
included for such classes as hydrocarbons, halogens, and metals. 

The screening of the stxnictural records via the fragmentation search 
is a very rapid, essentially tape speed search. The search program operates 
on all Boolean logic parameters (and/or/not) which permits considerable 
flexibility in search strategy. All compounds obtained as answers satisfy 
the search parameters, and under no circumstances are answers excluded. 

That is, there is 100 percent recall by this search technique. As in any 
fragmentation code, however, the interconnections between the structural 
fragments may not match the requested substructure, so that the set of 
answers probably includes not only all exact answers but also some closely 
related compounds. The degree of relevance depends to a large extent on 
the nature and specificity of the substructure query. Experience with 
files of 20,000 to 55 >000 compounds indicates a 75 to 80 percent relevancy 
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for inost "types of Q,ues"tlonSj which is usu&lXy q.ui'te sa'tisfac'tory for mos"t 
users of this high speedy relatively economical search. However, the 
number of answers obtained determines in many cases whether or not this 
degree of rele"vance is satisfactory- -80 percent of 10 answers means that 
only 2 of the retrieved compounds are irrele"vant, but 80 percent of 1000 
or 10,000 answers is quite another matter. 

For searches that do require 100 percent relevance as well as 100 
percent recall, a second search technique is a^vailable that compares each 
atom and bond of the substructure with the connection table of the struc- 
tures on the master Registry File. This itesreitive atom- by-atom matching 
technique can be used independently of fragmentation search, but it is 
more economical to use the rapid screening technique first to select com- 
pounds that are potential answers. The specific answers can then be 
identified in this subset by the slower but exacting iterative matching pro 
cess. The hierarchy of the structure file organization also facilitates 
the retrieval since closely related compounds that satisfy the search re- 
quest can be retrieved without interrogating each individual notation. 

Questions may be posed to the system for either the Fragmentat‘j.on or 
atom- by-atom search levels in terms of "and," "or," and "not" logic. 

"And" logic requires the presence of an atom or group of atoms in the 
answer. "Not" logic specifies that an atom or group of atoms must not be 
present in the answer. "Or" presents alternatives which may occur within 
a substructure or alternative substructural units that may occur in the 
answers. The fourth possibility, "don’t care", allows atoms and bonds 
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within the substructure to be left unspecified. The same logic operators 
can also be applied to two or more substructures within the same query. 

For example, it can be specified that two substructures must not co-occur 
in the same molecule. 

« 

Uses of the Substructure Search System 

The potential uses of this substructure search system are quite 
varied. 

1. It provides the mechanism for automatically generating fragmenta- 
tion codes for any one of the several fragmentation search systems presently 
in use. Such a list of fragments can also be considered a profile to be 
used in automatically updating a user*s structure file or fragmentation 
file. The results of the automatically generated fragments can be re- 
tained for computer searches or can be printed as index-type listings for 
manual searches. This feature has been used to assign structural class 
terms, or MeSH terms, used in the National Library of Medicine's MEDLARS 
System, where the search results are the nomenclature terms corresponding 
to structural fragments of the confound. Highly specialized or customized 
fragmentation codes can be automatically added to the high speed screen 
level search record by running a one-time iterative search for the struc- 
tuxal fragments. The desired fragments are simply coded as routine search 
requests and the answers peiroanently recorded as a fragment for use in 
subsequent searches. And changing the fragmentation codes is as simple as 
redefining the substructure search and performing a single search. 
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^ Substructure Search also provides the ability to select from a larg^e 

^ file of compounds a smaller set containing specific fragments. Thus^ 

J all compounds having significant structural features pertinent to an 

organization's research interest could be identified and selected as a 
I subfile. Such a list of fragments can also be considered a profile to 

be used in automatically updating a user's structure file or fragmentation 




file. 



2. Since new compounds can be identified during the registration 
process, this system can provide an alerting service for new compounds 
containing specified substructures of interest to a user. New molecular 
ring systems imbedded within compounds have been identified in this manner 
to identify ring systems for supplements to Thie Ring Index . 

3« The search system provides customized searches for compounds which 
may have similar physical or biological properties because they have similar 
structural characteristics. Registry Numbers for compounds identified by 
Substructure Search can be used as parameters for the computer-based text 
searching in conjunction with conceptual terms and authors' names in the 
computer tape or printed versions of CBAC and POST . 

Custom searches are by no means limited to the CAS computer — they can 
be conducted by other institutions or organizations using their own equip- 
ment and files, and probably in conjunction with other internal data files. 

4. The search system provides the means for identifying and retriev- 
ing generic classes of compounds which can be combined with text material 
in handbooks or indexes for use as desk references. CAS is now developing 
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a computer-driven photocomposition system that vill rapidly format inter- 
mized text material and structural diagrams with high graphic arts (quality • 

Applications of the Registry System 

The information contained in the Registry files can be put to many uses. 
For example, the registration process itself identifies new compounds as 
it assigns the Registry Number, that is, it establishes the fact that the 
compound has been previously registered and therefore already has a Regis- 
try Number. The full benefit of the system for new compound alerting can 
not be derived until the records include all chemical compounds that have 
been reported in the literature. Only with a complete record, for example, 
will it be possible to detemine with certainty that a given compound is or 
is not new to chemistry. However, the present files do provide a valuable 
tool for determining whether or not a compound has appeared in the litera- 
ture since I965, with manual searches of the CA indexes used in the con- 
ventional manner to scan the earlier literature. 

Nomenclature 

The Nomenclature File extends the uses of the familiar printed CA 
indexes. This File includes systematic names, as illustrated by the 
Chemical Abstracts index name, as well as trade names, generic names, and 
established laboratory numbers. Each name is linked to the appropriate 
structure by the Registry Number. These synonyms for a compound provide 
a greatly expanded thesaurus of terms for use in searching such publica- 
tions as CA, CT, CMC and POST . Indexes have been produced directly from 
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the computer files through the use of computer- programmed rules for 
alphabetizing the compound nomenclature. A combination of names with 
molecular formulas provides molecular formula indexes. Another example 



of a molecular formula index prepared from the CAS files is an index pro- 
duced for the WLt4. The order of the element symbols within each molecular 
formula is the NOPS— nitrogen, oxygen, phosphoms, sulfur, followed by 
other elements in alphabetical order — rather than the modified Hill form 



used in CA. This NOPS sequence is generated by computer programmed 
instructions from the Hill fom which is used within the CAS files; thus. 



no professional or clerical time is necessary to accomplish the trans- 



lation on a routine basis. 



The versatility of the computer in reorganizing and reformating 
the stored infoimation in a single form is quite apparent here. Formerly, 
the production of multiple indexes in different sequences required the 
preparation of separate card files of data, manually sequenced into the 
different orders. Now this task can be performed easily and quickly by 



computer from a single input record. 

The Nomenclature File is also used as a means of entry into the system. 



For compounds which appear frequently in the literature, it is often con- 
venient to add new data to the files by matching names against the Nomen- 
clature File. For small collections of names, this matching process can 
ne done manually from alphabetically arranged indexes. For processing 
large collections of names, CAS has developed computer programs for matching 
names, retrieving the Registry Numbers and CA index names, and bibliographic 
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references to the files. Names which are ambiguous— that is, names identi- 
fied with two or more different compounds— are flagged within the files. 

A match against one of these ambiguous names provides multiple retrieval 

for review by the chemist, 

Buies have also been programmed for editing nomenclature, althou^ 
these editing checks are by no means as complete as are those for the 
structure connection tables, Tlie present program checks primarily for 
consistency of format, capitalization, and italic izations, and auto- 
matically corrects many of the errors of this type that are identified. 

The nomenclature filing system also provides diagnostic messages for 
potential problems to be solved by nomenclature specialists. Work is 
continuing on extending the nomenclature editing features of the computer 
system. The nomenclature translation programs themselves provide a power- 
ful editing function since names which can not be successfully translated 
into connection tables (i.e., contain errors) are rejected by the trans- 
lation program with appropriate diagnostic messages for review. One of 
the first applications of the Nomenclature Translation programs this year 
will be the generation of the connection tables from the names already 
stored in the Nomenclature File and comparison of the generated table with 
the connection table already on the structure file for that compound. 

The third file of data contained in the Registry System is the file 
of bibliographic citations. One of the more obvious uses of this informa- 
tion is a bibliography for any specific compound or for groups of compounds 
such as would be identified through the Substructure Search System. This 
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type of retrieval vas used to obtain the references to the substructure 
search answers obtained in the demonstration of that system at the ACS 
meeting in New York in September of I966, A less obvious use of this 
file is the retrieval of Registry Numbers, then nomenclature, based on a 
given reference — for example, the reference to a book such as The Merck 
Index, Selection criteria based on some 30 reference works have been 



used to organize data into various index formats for the Food and Drug 
Administration and the National Library of Medicine, Another use of the 
bibliographic data— one which is not yet fully operational — is the selec- 
tion of compound- related data for publication of indexes in the primary 
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Journals, 

Registry Numbers have been appearing in the Journal of Organic 
Chemistry since March I967. The publication of the Registry Numbers in 
the primary publication is the first step in proving molecular formula 
and nomenclature indexes to the compounds within the primary publication. 
The link between the two systems — the parent ACS organization and its 
secondary publishing arm, CAS, is the bibliographic citation to the 
original article and the corresponding citation to the ^ abstract which is 
the citation in the Registry files. Much of the work necessary to provide 
all the necessary links on a time scale consistent with the publication of 
the article in the Journal has already been completed; the conversion 
should be possible when CAS has fully computerized its handling of the 
bibliographic citation data that appears in Chemical Abstracts. 
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Summary 

This paper presents, in brief, an overview of the CAS Compound Regis- 
try System, its component files, and the many uses to which this new 
computer-based facility can be put. Perhaps the most important aspect of 
the Chemical Compound Registry is the unified nature of the data bank. 

Each of the different data elements has been identified and flagged; this 
peiroits the selection and reorganization of any of the material into a 
vast number of specially designed formats — either in printed or machine- 
readable forms, or both. In addition, any given subset can include data 
from other portions of the total CAS data bank. Pilot services which 
utilize these features are already available and in use. The full potential, 
however, will be realized only as imaginative chemists and chemical engi- 
neers begin to identify new and better infoimation needs within their own 
programs . 
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APPENDIX C 



A DESCRIPTION OF THE 
CAS CHEMICAL REGISTRY SYSTEM 



A. Introduction 

B* Data Flow for the Registry System 

C . Registry Form 

D. Registry Numbers 



Extracted from the CAS "Registry Division Operations 
Manual" Copyright 1968 by the American Chemical Society. 
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ReKistiy System Operations 



A* Introduction 

Since 1958 > CAS has been working toward establishing a computer- 
based system for handling chemical information, such as physical properties 
chemical reactivities, biochemical activities, and applications. 

Because of the importance of chemical structures and the need to 
interrelate structures and corresponding chemical, physical, and biologi- 
cal data, a subsystem called the CAS Chemical Compound Registry System 
is the first step in the operation of an over-all con^uter-based service. 

The Chemical Compound Registry System is a machine (computer) record 
of chemical structures, and chemical constitution expressed as molecular 
formulas, names, and literature references. Tlie process of registration 
includes the assignment of a unique number to each different chemical 
structure. This number is called the Registry Number. The Registry Number 
is a computer-generated nine-digit number, which is not an ’’intelligent" 
number. That is, the number does not convey any information about the 
structure with which it is associatn^d. The units position is a "check 
digit" generated by the computer. This is a safety factor to reduce errors 
from miscopied numbers; it is a means for allowing the computer to reject 
a Registry Number which is wrong. The Registry Number is the link between 
the structure of the compound and all other information about the compound 
in the additional records which will be established in the future, con- 
cerning physical and chemical characteristics, biological and medical 
properties, and practical industrial uses. It is intended that eventually 
the computer system will have records of every chemical substance and all 
the useful published material bearing on each substance. 
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Registry System Operations 



Since it is intended that the machine record xd.ll eventually include 
all compounds, it is important that registration be as specific as possible; 
that is, each sti^icture must have a separate, different. Registry Number. 
This idea is follovred through in that an acid and each of its salts have 
separate Registry Numbers - acetic acid, sodium acetate, aluminum acetate, 
for example, all have different Registry Numbers. Similarly, with bases - 
aniline, aniline hydrocliloride, and aniline hydrobromide have different 
Registry Numbers. Optically active compounds and racemic mixtures such as 
d**mandelic acid, l*mandeUc acid, and dl-mandelic acid each have unique 
Registry Numbers; so also does mandelic acid with no stereochemistry 
specified. Each specified stereoisomer of a given set, such as the 
l-hydroxy-2-chlorodecalins, also receives a separate Registry Number. 

I»8>beled compounds, radioactively or othervrise, receive Registry Numbers 
different from those of the normal compounds, such as toluene, toluene 
labeled with carbon-14 in the meta position and toluene labeled with 
tritium in the para position. Biochemical literature often furnishes 
information about an ion, such as lactate, acetate, or pyruvate, and in 
such a case the ion receives a Registry Number different from that of the 



parent acid or a specific salt. Geometrical isomers such as cis— stilbene 



and trans- stilbene have different Registry Numbers; so also does stilbene 
with no stereochemistry specified. 
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Registry System Operations 

B. Data Flow for the Registry System 



Registry operational flow is illustrated in a simplified diagram as 
shown in Figure 1. (numbered points are described below. ) 

The structures which are recorded by computer processes originate as 
shown in Figure 1 (top center) (l) in the preparation of structures for 
compounds selected for index entries in the CAS Subject and .Formula Indexes 
and for compounds appearing in digests in the journals Chemical-Biological 
Activities ( CBAC ) and Polymer Science and Technology - Journal and Patent 
Sections ( POST-J and POST-P ) ; (2) from other sources, CAS files, non-CAS 
.files, reference works, etc. — in some cases the structures are already 
present (as in the Merck Index) and in other cases the structures must be 
supplied, usually by the Formula Indexing Department. 

It should be noted that, in the routine work flow involved in CA 
Indexing and Registry Divisions, the structures prepared in the Formula 
Indexing Department pass into the Registry Division for the purpose of 
registration and then proceed to the Subject Indexing Department so that 
the indexer can supply an index name for the compounds. 

1. Information enters the Registry System on specially designed Registry 

Forms, one for each compound to be registered. As it leaves the 

"structure drawing operation", the form contains a preprinted temporary 

identification (TID) number and the following information supplied by 

« 

the originating chemist: the structural formula to be registered, the 

molecular formula (often abbreviated "molform"), the nomenclature* 

* Nomenclature includes systematic and nonsystematic names, acronyms, 
laboratory numbers, etc. 
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Registry System Operations 



(not routinely 9 but in some cases), the bibliographic citations to 
Information about the coiiopound, and a term or descriptor that describes 
the stereochemical, labeling, and, in some cases, other aspects of the 
structure. (This descriptor, presented later in detail, is one of the 
conventional stereochemical descriptors used in the chemical literature, 
or a device to differentiate further between struct\xres . ) 

2* Structures are checked and classified by the originating chemists as 
to the necessary mode of registration. 




3. The Registry Forms are sorted and batched. Most confounds are machine 
registered, but manual registration (cf. point 15) is used for: 

(l) compounds ifith more than 255 non-hydrogen atoms, and (2) compounds 
for which structuring conventions have not yet been establish( 2 d for 
the Registry System. 



4. The connection tables or structures, molforms, source codes, and 
tesrt descriptors are keyboarded and processed according to point 7 
below. 



5. The Registry Forms are placed in the Inwork File. 



6. Registry Forms remain in the Inwork File until registration is complete 
for the compound. They are then added to the Master File of Registry 
Forms (cf. points 9 &>nd lO). 

7* The information keyboarded in point 4 above is entered into the 

computer, where the compoimd is either (l) registered (cf. point 8); 

. (2) added to a listing of compounds that have the same "two-dimensional" 
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structure, but differ in textual descriptors (cf. point 11) or; (3) 
rejected because of keyboarding or connection-tabling errors (cf. point 13). 

8. As compounds are registered, records are kept of compounds new to the 
system and compounds that have been registered previously. Gummed 
labels containing the assigned Registry Number are printed for all 
compounds. Previously registered compounds are identified by a series 
of asterisks which precede the T3D on the label. 

9* The gummed labels are placed on the proper Registry Forms in the 
Inwork Pile. 

10. Registry Forms are filed, in Registry Niunber order, in the Master 
File of Registry Forms. 

11. If a connection table for a structure being processed is the same as 
one or more previously registered structures, but has a different 
textual descriptor, then all structures in question are listed for 
resolution by a chemist. 

12. With the aid of Registry Forms pulled from the Inwork and Manual 
Piles the chemist resolves the problems determining which structures 
are identical and which differ in their stereochemistry. Resolved 
problems are re-entered into the system at point 7. 

13 . Compounds rejected for keyboarding or connection tabling errors are 
reviewed clerically, and errors that are detected are corrected and 
re-entered into the flow at point A in the diagram. Errors that 
cannot be resolved clerically are sent to a chemist for review. 
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l4. The revievdng chemist either corrects the error and re-enters the 
Registry Foiro into the system at points A or B, or, for problems he 
cannot resolve on the basis of the information available to him on 
the Registry Form, returns the Foim to its source (structure drawing 
operation) for resolution of the problem. 



15. Compounds that cannot be machine registered are registered manually 
by a chemist. For this purpose, several files of manually registered 
compounds are maintained either in molform order or in name (alphabetically) 
order, or in Registry Number order. The chemist either retrieves from 
the file the Registry Number previously assigned to the compound or 
else assigns a new number if the compound has not been registered. 

The TID and Registry Number of the manually registered compounds are 
keypunched and added to the machine files. 



l6. The TID, molform, nomenclature, and bibliographic citations are 

input into the Data Sheet system. They are sent to data input routines 
for the Bibliography System, 



17. 



Ncune Match Registration process - A system v/hereby the Nomenclature used 
to identify substances to be registered are keyboarded and automatically 
matched against the All Name File to recover previously assigned Registry 



Numbers and Index Names and to permit updating of the Bibliography File. 
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C. Registry Form 
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A form called the Registry Form is used for the structures, a separate 
form for each structure. This form has undergone several revisions; the 
latest is shown as Figure 2, The information given on this form, if the 
source is the current CA volume, is for the use of the Subject Indexer as 
well as Registry Division personnel. 

A description of the form follows; 

1. jDD (identification) - Space for control information relative to 
generation of the shefit. 

2. Ref. Source (Reference Source) - The source of the information on 






i 
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the sheet. 

3» Vol. (Volume) - The ^ volume number, or alternatively the names 



or abbreviations of whatever the source is for the structure. 



4. Col. (Column) - The ^ page (column) number and fraction where 



reference is made to the compound in the CA abstract, or alternatively, the 



page reference used in another source. 

5» Abt. (Abstract) - The number assigned to the CA abstract. The 



abstracts in a column are numbered in sequence. 

6. Compd. No. (Compound Number) - Ttie number assigned by the Formula 



Indexer to the compound in the particular abstract, original paper, or unit 



of work. Each compound selected for indexing within a given original 




publication is given a niunber in sequence from one to the final number of 



compounds in the publication. 

7o Sheet No. (Sheet Number) - The consecutive working record number 



of the chemist preparing the structure. 



i 
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8. Chem. (Chemist) - (a) The person who prepares the structure and 
calculates the molecular formula; (h) the person who provides the CA index 
name for the compound. 

9* Codes - The, mode code (a letter) given first which indicates the 
input method (R = connection table; no other codes have been assigned yet), 
and the source code (a two-digit number) which indicates the source (for 
example, l6 is the code for CA Volume 63). RI6 indicates that the structure 
is from CA 63 and has been recorded by connection table. 

Reg. No. (Registry Number) - Assigned out of a handbook (for 
example, the Name Index) or as a result of mechanized recording. 

11. Added Ref. File Codes (Added Reference File Codes) - Multiple 
references for a compound appearing in source documents following the first 
reference (point 2). 

12. Names - Name(s) or designations (such as a Roman Numeral) of the 
compound as provided by the author of the original paper. 

13. TIP (Temporary Identification Number) - A mechanically generated 
number, preprinted on the Registry Form, used to identify a compound until 
the Registry Number is assigned. 

14. TIP - Space for typing the TIP from 13 • 

15. Prop. (Properties) - Hiysical properties noted in the abstract or 
original, for the aid of the Subject Indexer. 

16. TIP - Identical to the TIP of point 13, but appearing in the area 
to be used for possible optical scan input of the structure. 

17. R (Ring) - If computer ring system analysis is required for indexing 
use, the R is crossed out. 

18. Text-desc. (Textual Pescriptor) - Notation of stereoisomerism, 
isotopic labeling, or other informaticn to be added in an appendix to the 
connection table. 
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19 . (index Molecular Formula) - The molecular formula used by the 
Indexer; this space is not filled in when the IMF and MF (See point 20) are 
the same. 

20. MF (Molecular Formula) - The molecular formula, as used in the 
Registry System, for the compound. 

21. Structure Area (Not actua3J.y labeled - the large blanlc area in the 
center of the Registry Form) - The area in which the structure is drawn or 
reproduced. 

22‘, Notes - Observations of the person who prepared the structure. 

They pertain to other information on the Registry Form. 

23. PIN EH (Preferred Index Name) (Uninverted) - The preferred Index 
Name in the inverted form, as it appears in the CA Subject Indexes. If the 
name is in the uninverted form, notation is made in the U-box. This space 
is often vacant, since the name may not be available to the person who 
prepares the structure. 

24. (Added Index Name) - An additional Ck name which appears in 
the CA Subject Index. For example, some complex esters are given both acid 
and alcohol - based names, although the acid-based name is the preferred. 
This space is often vacant, since the name is often not available to the 
person who prepares the structure. 
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D. Registry Numbers 

The Chemical Compound Registry Number is a nine-digit number whose 
units position is a check digit generated by the computer. The number has 
no established pattern in that no part of the number corresponds with any 
feature of the structure to which it is assigned. 

The first eight digits of the Registry Number are a number serially 
assigned by computer at the time of registration. (The eight digits 
presently include from one to three zeros for the hundred- thousand, million, 
and ten-million positions.) The ninth digit (the units position) is a 
check digit used to reduce transcription errors. The check digit is com- 
puted on the basis of the first eight in the following manner. The position 
of each digit of the eight-digit number is numbered from right to left 
starting with the number one, without skipping. Each digit of the number 
is multiplied by its position number and the results are added. The digit 
in the units position of the final result is the check digit and becomes 
the units digit of the Registry Number. 

For example, assume the number serially assigned is 00095216; numbering 
the positions gives: 

87654321 

00095216 

Multiplying each digit by its position and adding the results yields, from 
right to left: 

(6 X 1) + (1 X 2) + (2 X 3) + (5 X 4) + (9 X 5) + (0 X 6) + (0 X 7) + (0 X 8) 

6 + 2 + 6 + 20 + 45 + 0 + 0 + 0 

= 79 
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The units digit of the answer is 9 this is the check digit of 
the number 00095216. Adding the check digit to the right end of the number 
gives 000952169 » the Registry Number. The zeros in the leftmost positions 
are omitted for printing and publication purposes. 

All of the computer programs at Chemical Abstracts Service which 
handle the Registry Number include the checking routine. Thus, for each 
data manipulation keyed to Registry Numbers, the Registry Number is 
validated. The chances of misidentifying one Registry Number for another 
are substantially reduced and much proofreading is avoided. 

Ti 70 special series of Registry Numbers have been established, one for 
mixtures, and the other for polymers. The Mixture Registry Numbers are 
characterized by the prefix I-K attached to a seven-digit Registry Number 
drawn from the eight million series; for example, MX8OOOOO8. The Polymer 
Registrar Numbers are coiaposed of the prefix PM attached to a seven-digit 
Registry Number drawn from the nine million series; for example, PM9000004. 
The machine validates the presence of the particular prefix involved. 

In order to make the Registry Numbers easily identifiable. Chemical 
Abstracts Service has established a standard format. The format is 
(D = digit): 

D -DD-D 
1-6 

From left to right, this consists of a group of one to six digits 
and an alphabetical prefix if present, a dash, a group of two digits, a 

dash, and the last digit. For example, 89-96-3; 3345-05-9; MX8264-01-9; 

« 

PM9016-24-4. 
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I. INTRODUCTION 

As part of the development of a computer-based 
chemical information system at CAS, it has been necessary 
to devise techniques for the registration of drawings of 
chemical structures. A major purpose of the CAS regis- 
tration process is to determine whether a particular 
siructure has already been stored in the system. The 
ability to make this determination makes it possible to 



utilize a computer to assign to every chemical structure 
a unique identifying label. This identifying label, referred 
to as a registry number, is the thread that ties together 
all information associated with a particular compound 
throughout the developing CAS computer system. It is 
because of this association, made possible by the regis- 
tration process, that CAS will be able to provide multiple- 
file correlative searches with assurance that all information 
on file for a particular compound has been located. 
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II. THE REGISTRATION PROCESS 

The registration technique that has been selected by 
CAS requires computer generation of an alphanumeric 
description for each chemical structure that is unique 
for that structure. The machine technique has not yet 
been extended to all types of structural detail, but tech- 
niques and computer programs are complete for generating 
the unique description for the two-dimensional projection 
of fully known nonpolymeric chemical structures. The 
third dimension is presently handled by the addition oi 
conventional stereochemical descriptors which are supplied 
by the chemist who prepares the structural diagram for 
input to the system.’ 

In the coming months present basic machine techniques 
will be extended to handle partially unknown and poly- 
meric structures. Work is also progressing toward the 
inclusion of the third dimension directly in the graphic 
record so that in time the full steric picture will be in the 
form of a single detailed coherent record of each structure. 
The initial approach, however, permits CAS to provide an 
operable registry system that will accommodate all com- 
pounds without awaiting the utopia of a complete set of 
machine techniques and computer programs that will 
handle ail chemical substances automatically. 

Once the unique descriptions for a set of input structures 
are obtained, the remainder of the registration process is 
simple and very fast. Since the description of each com- 
pound is in itself unique it is possible to organize both 
the input and registry files into a unique sequence. The 
use of this unique sequence reduces the actual registration 
process to a merging and updating of two serial files; 
therefore, it is the uniqueness of the machine representa- 
tion of a chemical compound that is the key to an effective, 
efficient, reliable registration system. 



is important since the generation of the unique^ table is 
based in part on a process of successive partial orderings 
as will be seen later in this paper. 

After establishing a structure representation in the 
computer memory, the compact connection table is formed 
by first numbering the nonhydrogen nodes of the struc- 
ture. This numbering proceeds from 1 using only the 
ordinal numbers. The numbers are assigned to the nodes 
of the structure according to the following rules: (1) a node 
is arbitrarily selected and assigned the locant, node 
number, 1; (2) the nodes attached to node 1 are numbered 
2, 3, etc. When all the nodes directly attached to node 1 
have been numbered, those which have not yet been 
numbered but which are attached to node 2 are numbered, 
and so on. This procedure is followed until all nodes have 
been numbered, or as in the instance of disconnected 
graphs such as represent ions, until the process leads to 
a point where not all nodes have been numbered, yet none 
of the unnumbered nodes is attached to a previously 
numbered node. Under such conditions another arbitrary 
choice is made among the unnumbered nodes for the next 
node to be numbered and the process of numbering is 
continued. 

Example I . — Assume the structure 

A-B-C 

■ i,. 

For this structure the following table shows the num- 
berings that result from application of the above rules. 



Locant 



Possible node assignments 



1 

2 

3 

4 



AABBBBBBCCDD 
BBAACCDDBBBB 
CD CDADACADAC 
DCDCDACADACA 



III. CHARACTERISTICS OF THE STRUCTURE DESCRIPTION 

The structure description employed in the CAS registra- 
tion process is a uniquely ordered list of the node symbols 
of the structure (or graph) in which the value (atomic 
symbol) of each node and its attachment (bonding) to the 
other nodes of the total structure are described. Such a 
list and description is called a “connection table.’’ Since 
this paper is not concerned with structure input, the 
connection table which is described is that stored and 
manipulated by the computer. The form of the table 
which is used within the computer is not the most conven- 
ient form for input to the system; thus the input form is 
translated by the computer into the “compact connection 
table’’ developed by D. J. Gluck of du Pont.^ In this form 
of the table, the nonhydrogen nodes of the structure are 
listed according to an exact set of rules. The application of 
these rules alone does not produce a unique table; it does, 
however, produce a partial ordering among the nodes of the 
structure. “Partial ordering’’ in this context means that 
at certain stages in the formation of the table certain nodes 
will receive preference for earlier listing in the table. This 

(1) The system is now an operational element of the publication process of CBAC, 
a new CAS computer based publication. 

(2) D. J. Gluck, J. CVm. Doc.. 5. 43 (1966). 



For the structure in example I there are 24 possible 
numberings using the numbers 1-4; however, only 12 of 
the possible numberings comply with the rules cited above. 
This reduction is a characteristic of the numbering rules 
and becomes more significant as the size and complexity 
of the structure being treated increase. 

When the entire structure has been numbered according 
to the preceding rules the connection table is formed by 
recording the structural relationships in the five lists 
which compose the connection table, as follows: 

1. The “FROM ATTACHMENT” List —This list is 
composed of X fixed length ranks where X is equal to the 
number of nonhydrngen nodes in the structure. In this list 
the ith rank is used to describe not more than one attach- 
ment between the ith node and one other node of the 
structure. At the ith rank is recorded the rank number 
of the lowest numbered atom attached to the ith node. 
If, however, the rank number which would be recorded at 
the ith rank is numerically greater than i, the ith rank is 
left blank. 

Example II . — Assume the following structure with the 
numbering shown 



2 




4 
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For this structure with the numbering shown the 
“FROM” list is shown below. The r?»nk numbers to the 
left in this and following examples are for the reader’s 
convenience and do not appear in the actual list. 

Rank no. From attachments 

1 Blank 

2 001 

3 001 

4 001 

5 004 

6 004 



2. The “RING CLOSURE” List. — This list is composed 
of X fixed length ranks where X is equal to the number 
of cycles (rings) in the structure. Structures containing 
no cycles have no such list. 

After the formation of the FROM list there will remain 
one connection, not described in the FROM list, for each 
cycle in the structure. These additional bonds or ring 
closures are defined in the RING CLOSURE list as 
follows: 

(a) For each ring closure, record in a rank of the RING 
CLOSURE list the locants of the two atoms involved. 

(b) In each rank of the RING CLOSURE list, order 
the two locants so that the first is numerically less than 
the second. 

(c) Order the ranks of the RING CLOSURE list so 
that the locant pair of the first rank is numerically less 
than the second, which is less than the third, etc. Thus, 
002 007 < 003005 < 003 006. 

Example III . — Assume the following structure with the 
numbering shown. 




For this structure with the numbering shown, the FROM 
list and the RING CLOSURE list are as follows: 



From 

Rank no. attachment Ring closure . 

1 Blank 

2 001 

3 001 

4 001 

5 002 

6 002 

7 003 

8 004 

9 006 

10 007 

003 005 

008 009 

It should be noted at this point that the FROM list 
and the RING CLOSURE list are sufficient to completely 
describe the interconnections of the graph for the two- 
dimensional projection of the compound. 

3. The “NODE VALUE” List.— This list is composed of 
X fixed length ranks where X is equal to the number of 
nonhydrogen nodes of the structure. In this list the ith 
rank is used to describe the node value (atomic symbol) 
of the ith node (see example IV below). 



4. The “LINE VALUE” List.— This list is composed of 
X fixed length ranks where X is equal to the number of 
bonds in the structure between two nonhydrogen nodes. 
In this list the ith rank is used to describe the line value 
or bond for the attachment defined at the ith rank of the 
FROM list or RING CLOSURE list. For purposes of 
definition, the ranks of the RING CLOSURE list are 
assumed to be numbered consecutively after the FROM 
list. The bonds (i.e., line values) are described by assigned 
code. 

Example IV . — Assume the following structure with the 
numbering shown. 



Node value 

N ■ 

C 

C 

s 

N 

C 



5. The “MODIFICATIONS” List.— This list is used to 
describe any other modifications of the nodes and lines 
as listed, such as the charges of ions, isotopic mass, and 
citation of unusual valence.^ Such modifications are 
described by citing the type of modification in coded 
fashion, followed by the node number or line number 
being modified, followed by a description of the modifica- 
tion in coded form. Since the techniques for treating 
this list are merely an extension of the techniques applied 
to the previous four lists, discussion of the MODIFICA- 
TIONS list will be omitted from the remainder of this 
paper. 

The compact connection table is at this stage an un- 
ambiguous description of the two-dimensional projection 
of a chemical structure drawing. Where necessary it is 
made unambiguous for three-dimensional struc^res by the 
addition of conventional stereochemical descriptors as 
previously mentioned. Thus, the table is at this stage 
an unambiguous but non-unique machine representation 
of the chemical structure. It is one of a family of unambig- 
uous descriptions of the structure. The exact table selected 
for use in the CAS Registry System is a member of this 
family and is selected by further computer processing. 

In the follow' j pages the techniques for selecting the 
unique table from among the family of unambiguous tables 
will be shown to be completely independent of the order 
of the nodes in the input table. Since the ordering process 
is independent of the order of the nodes in the input table, 
it follows that the unique table is also independent of both 
the orientation and the projection of the drawn structure. 
It also follows that the ordering process is independent 
of the means by which the drawn structure is converted 
to a machine representation, e.g., Army Chemical Type- 



Line value 

Blank 

1 

1 

1 

2 

2 

1 






.[Pf-S. 



Rank no. 
1 
2 

3 

4 

5 

6 



From 

attachment 

Blank 

001 

001 

002 

002 

003 



Ring closure 



005 006 



(3) The editing routinea include a check for normal valence. Thua, for example, 
a trisubstituted methyl free radical requtree the specification . that only three groups 
are directly bonded to the methyl carbon instead of the usual four. 
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writer/ optical scanning/ clerically generated connection 
table,® grid structure,’ * or linear notations,® so long as the 
resulting machine representation is in fact a representation 
of the structure in question. The points expressed in this 
paragraph are very important since the CAS Registry 
System is based on the premise that a unique structure will 
be stored once and only once, thus making the registry 
number a unique and unambiguous identification of a 
chemical substance. 

IV. THE GENERATION OF THE UNIQUE DESCRIPTION 

As has been stated, the unique table used in the CAS 
system is a member of a family or set of tables, all of which 
describe the same structure equally well. It is unimpor- 
tant, therefore, which member of the set is labeled unique 
so long as the same table is always selected for the same 
structure. Since it can be shown that the set is finite for 
any graph composed of a finite number of nodes, it is 
possible to select the unique table by generating all mem- 
bers of the set, lexicographically ordering the members of 
the set based on the characters involved in the description, 
and then selecting the first member of the resulting list as 
the unique table. This concept is a restatement of a tech- 
nique proposed by C. N. Mooers for generating a unique 
cipher based on a process ot making all possible ‘"cuts 
and comparing the resulting ciphers.'" " 

The generation of all possible tables of the type des- 
cribed would, in the case of large molecules, be prohibi- 
tively expensive. It is necessary, therefore, to devise 
techniques to limit the number of tables that must actually 
be generated to some invariant subset of tables which is 
small enough to make the process economically feasible. 
Having generated this invariant subset, the unique table 
is selected in exactly the same way as if the entire set 
had been generated. It does not necessarily follow that 
the same table would be selected from the subset as would 
be selected from the entire set, but that fact is not impor- 
tant so long as only a single subset is generated for a given 
compound regardless of the order of the nodes in the input 
table for that compound. 

In order to generate only an invariant subset of the 
possible set of tables, the computer program first employs 
the rules for numbering the structure and forming the table 
as described earlier. This procedure reduces what would 
be a factorial expression to a number which is almost 
always significantly smaller. For instance, in a simple 
six-membered ring there are 720 possible numberings: 
however, only 12 comply with the rules for numbering. 
Thus, the numbering rules have created an* invariant 
subset. In addition to the rules of numbering, the com- 

(4) Feldman, t). B. Holland, and 1). P. >I;H‘obus. r/, ( h*‘ni. 3, IH7 (UHvU. 

(5) W. K. Cossum. M. E. Hardenhrook. and R. N. Wolfe. /Vo( . ,4m. Dnc. Inst.. 
269(1964). 

(6) G. M. Dyson. W. E. Cassum. M. F. Lyneh. and H. L. Morj^an. Inform. ^torciiH' 

60 (1963). 

(7) P. Horowitz, and E. M. Crane. “HECS.M JON: A System for Computer Storage 
and Retrieval of Chemical St nu’ture,’’ Eastman Kodak Co.. RtK’hester 4, N. Y., 1961. 

(8) W. H. Waldo, and M. DeBarker. "Proceeding of the International Conference 
on Scientific Information, Washington. D. C., Nov. 16-21. " Washington. D. <\. 19.'S9. 
pp. 711 730. 

(9) H. T. Bonnett. J. Chem. Dav.. 3. 235 (1963). 

(10) C. N, Mooers, "Ciphering Structural Formulas The Zatopleg System.’ 
Zator Technical Bulletin No. 59. Zator Co.. 79 Milk St.. Boston 9. Mass. 

(11) C. N. Mooers, "Generation of Unique Ciphers tor a Finite Network. Zator 
Technical Bulletin No. 49. Zator Co.. 79 Milk St.. Boston 9. Mass. 



puter program employs certain invariant properties of the 
graph to reduce further the size of the subset. These pro- 
perties are the “connectivity value” of each node, the node 
value (atomic symbol), and the line value (bond). 

The second means by which the subset is reduced in 
size is by introducting a partial ordering among the nodes 
of the graph. The selection of the next node to be listed, 
where a choice is po.ssible, can then in many cases be 
resolved on the basis of a preference implicit in this partial 
ordering. A simple illustration of such a partial ordering 
is shown in example V where preference is given to the 
nodes with the greater number of attachments at each 
point of choice. 

Example V. 

PtKsibihties for the order 
Structure of citation of the n<'dts 

A--B-C-D B C 

C B 

A D 

D A 

In example V only nodes B and C were considered for 
node 1 because of the preference introduced by the partial 
ordering. Having selected one, the other is given prefer- 
ence for node 2 again because of the partial ordering. 
Having listed nodes I and 2, nodes 3 and 4 are fixed 
because of the rules for numbering. Thus, in this example 
the subset generated will consist of only two tables, 
whereas without the use of the partial ordering six tables 
would have been generated, and without both the partial 
ordering and the rules for numbering twenty-four tables 
would have been required. 

Although the partial ordering of the nodes ba.sed on the 
number of attachments will usually greatly reduce the 
number of tables in the subset, it is not suificient to 
adequately partition the .set. The reason for this is that 
in organic chemistry the number of bonds to any given 
atom rarely exceeds four or five. In order to increase the 
effectiveness of the partial ordering, a technique has been 
devised for computing a “connectivity value” for each 
node based on the invariant properties of the graph. These 
values are then used to introduce a partial ordering among 
the nodes in the same fashion as the number of connections 
were used in example V. 

The “connectivity values” are computed by first 
assigning to each node an initial “connectivity value” 
equal to the number of nonhydrogen atoms attached to 
that node. This number is clearly an invariant property 
of the graph. The computer then calculates the number {k) 
of different “connectivity values” which had been as- 
.signed. An iterative process is then established which cal- 
culates a new “connectivity value” for each node. This 
new value is the sum of the assigned v alues for the nodes 
connected to the one under consideration. Having com- 
puted a new value for each node based on the previous 
values, the computer calculates the number {k') of diff- 
erent values in the set of new values. If k' > k, the new 
values are as.signed to the corresponding nodes, k is set 
equal to k', and the summation process is repeated. If, 
however, k' ^ k the process is terminated, and the last set 
of values assigned to the nodes is used to induce a partial 
ordering among the nodes. Using this partial ordering, the 
size of the subset is reduced by giving preference to the 
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node associated with the higher “connectivity value” at 
each point ot choice in the numbering process described 
earlier. It is important to note that the iterative process is 
finite for any graph of X nodes where X is a finite number. 
The process will terminate, under the conditions cited, 
after no more than (X + 1) iterations since there are at 
most X values that can be assumed by k which will cause 
the process to continue. Examples 1 and 2 of Appendix I 
illustrate the application of this technique for introducing 
a partial ordering among the nodes of the graph. 

Alter introducing the partial ordering among the nodes, 
the generation of the subset of tables defined by this 
ordering is begun. Even at this stage, however, the entire 
s-.jbset is not always generated since in practice it is often 
o .ssible to eliminate large blocks of potential tables. To 
^lescribe the means by which potential tables are elimi- 
nated during the generation process, it is necessary to 
describe the means by which the unique table is ultimately 
selected. 

After generating any two of the tables of the subset, a 
preference between them is introduced by “alphabetizing” 
on the basis of the collating sequence of the machine 
symbols involved in the tables. The table which “sorts” 
to the top of the list is then selected as preferred over the 
other. If the two tables are identical, one is arbitrarily 
selected as preferred over the other. For purposes of this 
“alphabetization,” the tables are treated as a string of 
symbols in the following order (see example 3 of Appendix 

I): 



A. The “FROM ATTACHMENT” list 

B. The “RING CLOSURE” list 

C. The “NODE VALUE” list 

D. The “LINE VALUE” li.st 

E. The “MODIFICATION” list 

Since a preference or a lack of preference is introduced 
each time two tables are completed, it is never necessary 
to have more than two complete tables in memory at 
any given time. 

During the table generation process, when a complete 
table is in the computer memory and a second table is 
being generated, a determination is made after each step 
in the generation process whether the first, completed table 
is already preferable to the second, partially generated one. 
This determination is accomplished by comparing the two 
FROM lists up to the point of completion of the second 
and selecting, as preferred, the one which “sorts” first. 
If it is determined that the completed table is already 
preferred to the second, further generation of the second 
table is stopped, and all tables based on the fragment 
thus far generated are eliminated. 

Another means of eliminating potential tables during 
the generation process is the provision of performing a 
simple look-ahead to determine a preference or lack of 
preference. 

In chemical structure drawings it is quite common to 
have two or more terminal atoms attached to the same 
atom. (A terminal atom is here defined as an atom at- 
tached to only one nonhydrogen atom.) The partial 
ordering of the nodes as described above does not resolve 
the order of selection of these terminal atoms. Thus, 
without piovkior. ^ 0^-2 simple look-ahead, the alternatives 



would need to be generated and the tables compared to 
determine a preference. 

Example VI. 



B 

I 

C— A-E F 

I 

D 

Partial ordering of nodes 
|A( > |E! > |B,C,D,F) 

In example VI the partial ordering will cause nodes A 
and K to be selected as the first and second nodes, respec- 
tively. At this point, however. B, C, and D are equal 
candidates for selection as the third, fourth, and fifth 
nodes, thereby giving rise to the generation of six tables. 
To prevent this common situation the computer detects 
the condition and performs a look ahead to determine the 
effect of the possible choices on the next levels of the table. 
This look-ah^d can be done since the choice cannot affect 
the FROM or RING CLOSURE lists. Therefore, the node 
values are examined and any preference implied by them 
is introduced. If, however, the node values are equal, 
the determination of a preference falls next to the line 
values and finally to the node and line modifications. If 
the choices are equal at every level then it makes no 
difference which is selected next since the choices give rise 
to identical tables. By application of this simple look- 
ahead the program is able to eliminate the generation of 
the possible alternatives and the selection from among 
them. In example VI, for instance, only one table will 
be generated instead of the six which would have been 
generated without the look-ahead technique. 

At present the look-ahead technique is used only to deal 
with the case of terminal atoms. The technique could 
be extended, however, without affecting the ultimate 
choice of the unique description. Determination of 
whether this extension is economically required will be 
made on the basis of operating experience in the coming 
months, but at this point it seems unlikely to be necessary. 

The last technique for reducing the number of tables 
generated is the provision to recall, under certain condi- 
tions, preferences detected during the generation process. 
Because of the nature of the techniques thus far described, 
the size of the subset is a product function based on the 
number of choices arising; that is, the same preference 
or lack of preference is rediscovered several times. 

Example VII. 



('ojTjputer coimuriivii*. values 

A B C D E F 

0 4 1112 1/? 

1 5 4 4 4 5 2 /? 



3 * 

;l 



These values used for 
partial ordering 



E— D-A--B-C 

I 

J— I— F-G— H 

I 

K 

For instance, in example VII the preference or lack ol 
preference between B and D will be determined twice, 
once when G and I are listed third and fourth, respectively, 
and once when G and I are listed fourth and third, 
respectively. It would be more efficient to remember 
the preference once detected and to use this information 
should the same choice arise again. The problem is that 



112 



H. L. Morgan 



the preference can be recalled only when it is independent 
of any previous choices; therefore, the preference can be 
remembered only under certain conditions. These condi- 
tions are: first, the atom, from which the choice arises, 
atom A in example VII, must be bonded to exactly three 
other nonhydrogen atoms, two of which are involved 
in the choice; and, second, the bond not involved in the 
choice, bond A-F in example VII, must not be part of a 
cycle. 

If during the generation process a point of choice is 
reached which meets the conditions cited , the computer 
program divides the graph into two subgraphs by re- 
moving the bond which is not involved in the choice. In 
example VII the bond between A and F is temporarily 
removed. By virtue of the fact that the removed bond is 
not a member of a cycle (specified condition), the result 
of this removal is to divide the graph into two subgraphs. 
The program then operates on the subgraph involved in 
the choice so as to determine a preference or lack of prefer- 
ence between the two choices. This preference is de- 
termined by generating the set of tables which arises 
from the choices in the subgraph. Once such a preference 
is determined the graph is restored, and the preference 
is recalled when the same choice arises again. If there 
is no preference between the two choices, then it makes 
no difference which is selected since they are indistinguish- 
able. In this case, the preference will be made arbitrarily 
and reused should the opportunity arise again. 

Of the several methods employed to reduce the number 
of tables generated the two most significant are (1) the 
partial ordering of the nodes by the computed “connec- 
tivity values” and (2) the rules for numbering the nodes 
for table generation. Together these two methods comple- 
mented by the other techniques reduce what would other- 
wise be a devastatingly time consuming task to one which 
requires only a trivial amount of time. 

In order to demonstrate the presumed advantages of 
the techniques described, they were programmed for an 
IBM 1410 Data Processing System. Over 25,000 chemical 
structures from CAS files, selected solely on the basis of 
immediate availability, were processed. The description 
of these structures and the statistics resulting from this 
test are shown in Appendix II. Based on these statistics 
and the published timings of other techniques which have 
been described in the literature, it appears that the 
presept technique offers significant economic advantage 
over other methods for accomplishing the same end. 



In this example the iterative process will be terminated 
after two iteratipns and the values assigned after the first 
will be used to introduce the partial ordering. In this 
example the subset of tables will consist of exactly four 
tables. 



Example II. 



i=0 
i = l 
i = 2 



i = 3 



i = 4 




Connectivity values Value of k subset, tables 

2, 3 2 14,592 

4, 5, 6, 7, 8, 9 6 160 

8, 9, 11, 12, 12 32 

14, 17, 18, 19, 

20, 22, 24, 27 

18, 19, 20, 21, 26, 19 8 

27, 28, 29, 31, 32, 

38, 41, 42, 46, 50, 

58, 68, 69, 75 

will also yield a value for k of 19; thus the 



process terminates and the values at t = 3 will 
be used to introduce the partial ordering of 
the nodes. 



Example III. 

O"' s' 

k 
J 

Using the connectivity values shown in example I of 
this apf endix a partial ordering among the nodes is 
introduced. 

Id > \h\ > 16, d\ > \k, e, i, g) > I;, /) 

Using this partial ordering the possible tables are gen- 
erated and compared giving preference for lower num- 
bering to the node or nodes which are between the left- 
most pair of braces at each point of choice. For this 
example four tables must be generated and the unique 
table selected from among the set. 

The preferred numbering and the corresponding unique 
table are showr below: 




Example /. 
1 = 0 



APPENDIX I 
1 = 1 



1 = 2 




1. 2,3 
A = 3 



O® S® 




connectivity values 
3, 4, 5, 6, 7, 9 
6 = 6 



0« s« 




connectivity values 
6, 10, 11, 17, 19 
6 = 5 



1 


C 


- 


— 


2 


C 


1 


1 


3 


C 


1 


1 


4 


C 


1 


1 


5 


C 


2 


1 


6 


C 


2 


1 


7 


C 


3 


1 


8 


0 


3 


1 


9 


C 


4 


1 


10 


s 


4 


1 


11 


c 


5 


1 


12 


c 


6 


1 


Rings 


7 


11 


1 




9 


12 


1 




9 

12 
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In computer storage the unique table appears as follows: 



From list | 




Ring closure | 


001001001002002003003004004005006( 


U 


0070110090121 


1 Node values | 




Line values | 


'cccccccocscd 


' 1 1 1 1 1 1 1 1 1 1 1 1 l' 



A The “Ring Index” structures including the first 

supplement 9,568 

B A CAS File of commercial compounds 7,154 

C The structures from Lange’s “Handbook” 4,596 

D The CAS File of compounds containing only carbon, 

hydrogen and sulfur 4,287 

Total 25,605 



The following is a table of statistics resulting from the 
testing of these techniques using the file described above: 



APPENDIX II 

In order to test the presumed economic advantages of 
the technique described in this paper, over 25,000 chemical 
structures were selected from the CAS files. These struc- 
tures were selected solely on the basis of immediate availa- 
bility and consisted of the following: 



A 

B 

C 

D 

E 



Sample size 

Total 1401 computer time for the 
generation of the unique description 
Average number of compounds per 
minute for the generation of the 
unique description 
Average cost per compound for the 
generation of the unique description 
Average number of tables generated 
per compound 



25,605 structures 
4.93 hr; 



92.8/ min. 

2.2 cents 

4.3 
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Modular Programming 

The redesigned system v/ill be v/ritten as a 
set of program modules^ each designed to perform 
a specific system function. This offers several 
advantages ; 



It increases the adaptability and flexibility 
of the system because modules can be changed 
without affecting the rest of the system. 

It will be easier to modify programs, since 
systems personnel will work with smaller pro- 
gram “pieces”. 

Modular programs make it easier to utilize 
subroutines from the libra.ry. 

Modular programiaing allows for future expan- 
sion -- for example, a module for interfacing 
the Registry v/ith the Substructure Search System. 



Technical Improvements 
1. Tautomers 

The new system provides computer programmed 
identification of unique compounds that can be re- 
presented by two different, but equally valid, 
structural diagrams. 

A generalized representation is: 

M=Q-(Q=Q) -ZHt=>H-M-(Q=Q)j^-Q=Z 

where M, Q, and Z are combinations of C, N, P, As, 

0, S, Se, and Te (including abnormal mass analogs); 
n ^ 0 (integral); = is a double bond; - is a single 
bond; and H must be present as shown. 

« 

The following are examples of the types of 
structures involved: 
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3 . Modular PrograrmTilng 

The redesigned system will be v/ritten as a 
set of program modules^ each designed to perform 
a specific system function. This offers several 
advantages ; 

a) It increases the adaptability and flexibility 
of the system because modules can be changed 
without affecting the rest of the system. 

b) It will be easier to modify programs, since 
systems personnel will work with smaller pro- 
gram “pieces”. 
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c) Modular programs make it easier to utilize 
subroutines from the libra.ry. 

d) Modular programiaing allows for future expan- 
sion -- for example, a module for interfacing 
the Registry v/ith the Substructure Search System. 



Technical Improvements 
1. Tautomers 



The new system provides computer programmed 
identification of unique compounds that can be re- 
presented by two different, but equally valid, 
structural diagrams. 

A generalized representation is: 

M=Q-(Q=Q) -ZHt=>H-M-(Q=Q)j^-Q=Z 

where M, Q, and Z are combinations of C, N, P, As, , 
0, S, Se, and Te (including abnormal mass analogs); 
n ^ 0 (integral); = is a double bond; - is a single 
bond; and H must be present as shown. 

« 

The following are examples of the types of 
structures involved: 
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General representation 

M=Q=(Q=Q) -Z~H 
M=Z=N ( t rivalent ) 

Q=C 

n is limited to zero in the initial system; 
the value can be extended in subsequent versions 
(1,2,,,,) if justified. Analogous structures 
with P or As may be found to exhibit tautomerism; 
provision is made for such an extension if 
necessary 



N 



Me- 






•NH 



C — NHEt 



NH2 

I 

Me — C=NEr 



General expression. 

M=Q=(Q=Q)n-Z-H 

M=Q=Z=N 

As in a, n is initially limited to zero and 
provisTon is made for future substitution 
of P or As for N, 




± PhN = N— NH 











without piovkios'. simple look-ahead, the alternatives 






snouiQ tne same cnoice arise again. 
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a) General representation 

M=Q=(Q=Q) -Z~H 
M=Z=N ( t rivalent ) 

. Q=C 

n is limited to zero in the initial system; 
the value can be extended in subsequent versions 
(1,2,,,,) if justified. Analogous structures 
with P or As may be found to exhibit tautomerism; 
provision is made for such an extension if 
necessary. 
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N 



Me- 



'M 



Me- 






•NH 



NH 

■ II 

Me — C— 'NHEt 



NHo 

I 

Me — C=NEr 



b) General expression. 

M=Q=(Q=Q)n-Z-H 

M=Q=Z=N 

As in a, n is initially limited to zero and 
provisTon is made for future substitution 
of P or As for N, 





PhNH— N“--N 







■Br 



± PhN = N— NH 
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c) General expression 
M=Q-(Q=Q)j^-Z-H 

M=Z=0,S,Se,Te (and abnormal mass analogs) 

Q=C 

n=0 




Me-C~SH - <- 



.17 
■Ph-C-OH 



Et-C-SeH 




OH 

\ 

\ 

Me-C=S 



OH 

I 17 

Ph-C=0 



SH 



Et-C=Se 
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General expression 
M=Q-(Q=Q)^-Z-H 

M=Z=0,S5Se^Te(and abnormal mass analogs) 
Q=N,P,As^Sb(tri- or pentavalent ) , 

S,Se,Te(tetra- or hexavalent ) ,Cl,Br, I 
n=0 
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The nev; system incorporates improved procedures 
for handling cyclic systems. The paths to identify 
rings are traced starting with the ring closure pairs, 
thus eliminating the duplicate tracing that resulted 
from a trial-and-error basis. This technique provides 
improved efficiency as to type - - ring, chain or 
alternating single-double. 



'3. Automatic Editing of Text Descriptors 

The 360 system includes editing routines for 
text (stereochemical) descriptors which simplify 
structure problem resolution by eliminating part 
of the human involvement in the process. The a.mount 
of chemist time for such resolution will be reduced 
by about 35^* 



A table of standard, valid text descriptors in- 
cluding synonyms is used in a computer edit of this 
feature of the connection table. The program checks 
for the presence (exact match) of the input text de- 
scriptor in this table; rejects the table if the input 
text descriptor is not valid, (unless an override is 
coded); corrects certain descriptors (for the purpose 
of chemical checking because of ambiguity, marked *** 
in the table); and performs automatic resolution in 
ties involving valid unambiguous descriptors. 

Text descriptors have been developed for the 
main body of steroids, terpenes, alkaloids, and 
carbohydrates based upon an alphabetic term or 
base name, representing a basic parent structure 
with implied stereochemistry at specified positions. 
This base name is the name of the parent structure 
or a term closely related to it. 



For steroids, terpenes, and alkaloids, sub- 
stituents attached to the parent at a potentially, 
asymmetric atom require a locant or locants pre- 
ceding the base name. This consists of a numerical 
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2 . Improved Handling of Ring Bonds 

The new system incorporates improved procedures 
for handling cyclic systems. The paths to identify 
rings are traced starting v/ith the ring closure pairs, 
thus eliminating the duplicate tracing that resulted 
from a trial-and-error basis. This technique provides 
improved efficiency as to type - - ring, chain or 
alternating single-double. 

3, Automatic Editing of Text Descriptors 

The 360 system includes editing routines for 
text (stereochemical) descriptors which simplify 
structure problem resolution by eliminating part 
of the human involvement in the process. The a.mount 
of chemist time for such resolution will be reduced 
by about 35 ^* 

A table of standard, valid text descriptors in- 
cluding synonyms is used in a computer edit of this 
feature of the connection table. The program checks 
for the presence (exact match) of the input text de- 
scriptor in this table; rejects the table if the input 
text descriptor is not valid, (unless an override is 
coded); corrects certain descriptors (for the purpose 
of chemical checking because of ambiguity, marked *** 
in the table); and performs automatic resolution in 
ties involving valid unambiguous descriptors. 

Text descriptors have been developed for the 
main body of steroids, terpenes, alkaloids, and 
carbohydrates based upon an alphabetic term or 
base name, representing a basic parent structure 
with implied stereochemistry at specified positions. 
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For steroids, terpenes, and alkaloids, sub- 
stituents attached to the parent at a potentially, 
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This paper describes a computer-based input system which reduces or eliminates 
many repetitive operations. This system reduces and conserves the over-all human 
effort required for input of structural, nomenclature, and bibliographic data while 
simultaneously improving the efficiency of the registration operation and increasing 



the reliability of the stored data. 

Since early 1965, Chemical Abstracts Service has been 
developing an experimental computer-based. Chemical 
Compound Registry System which is being supported by 
the National Science F'oundation (NSF), the Department 
of Defense, the Food and Drug Administration, the 
National Institutes of Health, and the National Library 
of Medicine through a contract with NSF. 

• FresontfHl l)cfore the I>ivi«jon of ('hemirnl I,i!crnture, 152iul Nntionnl Mcotirif; 
of the American Chemical Society. New York. N. Y.. Sept. 15, lOhU. 

** Present addrc.ss: IHM('orp.. 1000 West chest t>r A vc., Harrison. N. Y. 
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The data processed for this computer-based system 
involve structures, names, and references for compounds, 
primarily tho.se processed in the current volumes of 
Chemical Abstracts {CA). I’hus, each six months, the 
Chemical Compound Registry System receives data on 
about 200,000 compounds which are processed at CAS 
for Chemical Abstracts, and the Rinf^ Index and its Supple- 
ments. Of these 200,000 compounds, about 163,000 have 
been reported previously in the literature, and have thus 
been processed previou.sly at CA.S. The remaining 37,000 
compounds are newly reported chemical entities. 

Journal of Chkmical Documentation 






Computeh-Baski) SuB.jiccT Index Support System at CAS 




Without the coini)uter-basc‘d sui)|>ort to be described 
in this paper, the registration of previously reported com- 
pounds for which new data continue to be documented 
in the literature would involve many hij^hly repetitive 
operations— drawing of structures, calculation of molecular 
formulas, identification of rinj^ systems, the derivation 
of systematic nomenclature, keyboarding, and editing, to 
name a few. 

This input sy.stem is an integral part of the e.xperimental 
CAS Chemical Compound Registry System which has been 
the topic of several recent papers (/), and will not be 
discussed in detail here. However, a brief review of the 
component parts of the Registry System is in order to 
clarify its relationship to the computcr-l)ased input system. 
The principal components of the Chemical Compound 
Registry System are three major computer files of 
compound-oriented data. The connecting link for these 
three files is the permanent computer address for each 
compound — the Registry Number. The three files are: 

1. The Structure File, which contains unique description.s of 
the structural formulas including stereochemistry and iso- 
topic labeling. 'I’he input to this file is .subjected to an 
elaborate computer editing program, described by Leitcr 
and Morgan (2). 

2. The Nomenclature File, which contains all of the names 
for a given comjround available in a variety of sources. 
The.se names are coded as to type ■ e.y.. preferred CA index 
name and trade name. Presently, editing for the Nomen- 
clature File is done mainly by chemists during the production 
of the CA indexes. The existence of the Nomenclature File 
makes possible the printing of a variety of special indexes 

an index of trade names versus the CA preferred index 
names. 

3. The Bibliography File, which contains references to the CA 
abstracts. By use of other existing computer files these 
references can be coordinated with the corresponding original 
journal references. 

These files of the experimental CAS Compound Registry 
System are reviewed for accuracy through the efforts of 



CAS chemists tind clerical personnel, aided by tbe 
coinputer-ba.sed input .sy.stem described in the following 
page.s. Two types of computer support are provided 
through this sy.stem, one based on the Structure File, 
and one based on the Nomenclature File. 



SUPPORT THROUGH THE STRUCTURE FILE 

The first set of procedures, based on the Registry System 
Structure File, is applied to compounds registered from 
those sections of CM reporting the highest percentage 
of new compounds —namely, the .synthetic organic sec- 
tions. 

Compounds selected for registration from these sections 
have their structures drawn and molecular formulas cal- 
culated by CAS professional staff. The structural formulas 
are then clerically processed for registration, In this opera- 
tion, all nonhydrogen atoms in the structural formulas 
are numbered in any convenient sequence. Then the atoms 
and bonds are keyboarded in tabular form for input to 
the cominiter. In the same operation the CM reference, 
any trivial names, and the calculated molecular formula 
are keyboarded for computer input The .structure 

is then registered, a process in which a computer program 
compares the structural information input with the data 
on file for all other structures. Registration results in 
one of two situations. 

The first, a “hit,” means that the compound has been 
registered previously and thus is already on file. In such 
cases, the computer program retrieves from the TCegistry 
Files the molecular formula, the edited and correctly for- 
matted index name(s), and the Registry Number for the 
compound. This information, together with the originally 
keyboarded CA reference for the current volume, is printed 
by the computer on a data sheet (Figure !)• and sent 
to a CM chemist for review. All of the information on 
this sheet is reviewed for accuracy and corrected for dis- 
crepancy as necessary. 



K' 



i 

fv j 

Iv 



f !• S 

cA 

I' 



P[ 

Pi 

k'- ' r 



^ — 



40764 DATA CHANGED SATURDAY, JUNE 18, 1966 

TO INDEXING 

Vol 64 Sec 42- 7 Start 9783e 1 End 9791a 0 Ind XXX Typ DPR Dat-66167 
64;9789f2-3 
F MF * 

R PINH * And ros ta— 3 , 5-d i e n-1 7-one 

C ID 64;9789f2-3 

T/R 1912636 



Figure 1. Structure file "hit" computer proof sheet. 
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Through the information retrieved from the Registry 
File, the input system has reduced the en'ort required 
for registration. That is, compounds for which hits result 
do not reejuire naming, and the names do not require 
keyboarding, editing, or re-entry into the computer. 

Additionally, if the compound contains a ring .system, 
the coinjjuter programs will determine whether the ring 
system is new or has been registered previously. (Records 
for .some 18,000 ring systems are on file as a result of 
the registration of the Index and its Supplements.) 

Where there is a hit, the computer prints a data sheet 
(Figure 2) giving tlje Registry Number, l^ing Index Num- 
ber, and molecular formula of the ring system. This saves 
the effort of renaming and reregistering the ring system. 

The second situation that may obtain as a result, of 
registration is “no hit,’’ meaning that the compound is 
new to the Registry System. In such cases, the computer 
prints a data sheet (Figure 3) containing the molecular 
formula and the CA reference for the current volume 
(that is, the data that were input to the computer for 
registration). These data are proofread to assure that 
they have entered the computer accurately, and then 
CAS professional staff derive the CA index name(s), which 
arc keyboarded and entered into the Registry Files. 



by the computer together with the name and CA reference 
dictated at the time of the comj)ound selection. Figure 
4 illustrates such a computer-printed i)roof sheet. 

Through this proceduie, the computer input system 
has again reduced the effort required for registration. For 
compounds for which hits re.sult, the structural formulas 
do not have to be drawn, the molecular formulas do 
not have to be calculated, and the preferred CA index 
names do not have to be generated, keyboarded, edited, 
or re-entered into the comi)uter. 

The second situation that may obt.ain after name input 
is “no hit,” meaning that the name is not on file. For 
such situations, structural formulas must be drawn and 
the molecular formulas calculated. The registration proce.ss 
then proceeds, with further support obtained as previously 
described under Support Through the Structure File. 

Even in the “no hit” situation, however, the name 
and associated data have been added to the computer 
record. The next time the same name is encountered 
in CA indexing, computer support through the Name 
File will be possible. 



SOME SAVINGS OF CLERICAL EFFORT 



SUPPORT THROUGH THE NOMENCLATURE FILE 

The set of input procedures Irased on the Registry 
System Name File applies to sections of CA other than 
the synthetic organic sections. These .sections contain a 
high percentage of compounds reported previously in the 
literature. 

CAS staff concerned with the selection of compounds 
to be registered dictate the avail.able systematic or trivial 
names of the selected compounds and the corresponding 
CA references. This information is then keyboarded and 
input to the computer, where each name is compared 
with the names already filed in the Nomenclature File. 
As with Structure File support, one of two situations 
obtains when a systematic or trivial name is compared 
with names on file in the computer. 

The first, a “hit,” means that the input name is already 
on file and therefore that the compound has been registered 
previously. In these cases, the computer program retrieves 
from the files the molecular formula, the edited and cor- 
rectly formatted CA index name(s), and the Registry 
Number for the compound. This information is printed 



Two areas where the computer-based support, system 
has efiected considerable savings in the use of clerical 
effort involve the single keyboarding of preferred index 
names and CA references, and the use of keyboarding 
“shortcuts” that save keystrokes during data input. 

CA index operations currently u.se as manuscript 3 x 
5 cards containing one entry per card. For compound 
entries, the name of the compound and the CA reference 
appear on two cards, one for the Subject Index, and 
one for the Formula Index. Before the computer-based 
input system was developed, these 3x5 cards were typed 
directly. In addition the registration operation then 
required another keyboarding of many compound names 
and CA references. These operations are now all combined 
into a single keyboarding of data for entry into the com- 
puter, which then delivers the index cards and records 
the appropriate information in the Registry. I’lie total 
keyboarding effort has been reduced by an estimated 25% 
as a result of this procedure. 

The use of keyboarding shortcuts has also been made 
possible through the use of the computer input .system. 
Under the old system, the typists were never afforded 
the possibility of shortcuts. For example, the registration 




Figure 2. Computer-produced ring query sheet. 
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Figure 3. Structure file "no hit" computer proof sheet. 



of six different esters of 2,4,6-triiodobenzoic acid required 
the keyboarding of tbe parent acid name six times. 

In the computer-based input system, however, the typ- 
ists now keyboard the complete name of the parent acid 
once. For the five remaining esters, a two-character “ditto 
code" is typed instead of the parent acid name (Table 
I). Tbe ditto code instructs the computer to print out 
the parent acid name keyboarded for the first entry. Note 
that the ditto code is fully expanded by the computer 
and thus is not carried in the permanent files. Therefore, 
it makes no difference in the stored data w'hether or 
not the typi.st uses the shortcut. 

Tw'o other ditto features are also used. One is used 
to repeat a name’s modification (that portion of the name 
that appears in light-face type in the CA Subject Index) 
from one entry to another. The second, used in registering 
alphabetized lists of names, repeats that portion of a 
name up to and including the first comma followed by 
a space— c..g., the comma of inversion. We estimate that 
42,000 keystrokes are saved by these three features every 



work day; this is equivalent to 5% of the total keyboarding 
effort. 



PROJECTED SUPPORT IN PRINTED ISSUE, VOLUME, 
AND COLLECTIVE INDEXES 

In 1967, CAS expects to install a much more advanced 
computer support system, w'hich will be directed primarily 
at support of index operations rather than Registry opera- 
tions. Through this system, CAS will eliminate the use 
of the 3x5 manuscript cards, and, instead, produce 
“camera-ready” copy from the computer for final proof 
and printing operations. This wdll result in greater com- 
puter support for the indexing operations. For example: 

1. Alphabetizing of the entries for an index will be performed 
as a computer operation. 

2. The merging of edited index volumes to collective indexes 
will also be performed as a computer operation. 
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Figure 4. Name file "hit" computer proof sheet. 
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D. J. VVlllTTINCillAAl, F. R. Wetsel, and M. L. Morcjan 
Table I. Example Use of "Ditto Code" to Save Keystrokes 



'I ypcc! its 

Benzoic ac 'd. 2,4,6-triiodo-. ethyl ester 

Ipmethyl ester 

Ippropyl ester 

Ipisopropyl ester 

Iphutyl ester 

Ij)3-nil ropropyl ester 



Comput(>r i>rint(‘d ns 

Benzoic acid, 2,-l,G-triiodo-. ethyl ester 
Benzoic acid, 2.4.C-trii{ydo-. methyl ester 
Benzoic acid. 2.4.G-triiodo-, propyl ester 
Benzoic acid. 2,4,G-I riiodo-, isopropyl ester 
Benz.oic ac:id. 2,4,G-triiodo-. butyl ester 
Benzoic acid, 2.4,G-triiodo-, 3-nit roproi)yl e.ster 



Further, new indexes and a data ba.se for searching 
property, reaction, and use infonnation from the computer- 
ba.sed Subject Index Support System will become a reality. 

Our users will receive the first products of this com- 
pletely mechanized system in 1967 in the form of the 
computer-composed volume Author and Formula Indexes. 
Note, however, that the Registry Support System 
described in this paper is in operation at CAS at this 
time. 



APPENDIX 

Explanation of Terms Used on Computer-Produced 
Data Sheets 

The data .sheets produced by the computer as desci ibed 
in the foregoing text carry several types of information 
for use in entering compounds into the CAS Chemical 
Compound Registry System. The sheets arc printed in 
work.sets grouped by CA sections and by column fractions, 
and within a given workset, each compound is represented 
by a single data sheet. The following is an explanation 
of the terms used on the data sheets in Figures T, 3, 
and 4 of this paper. 

The heading of each sheet includes a sequential number 
assigned by the computer, an identification of the type 
of sheet — c.A^. “new worksheet” — and the day and date 
on which the data were processed by computer. 

The following data are keyboarded once for each workset 
and are then automatically printed by the computer on 
each applicable data sheet. 

Vol The volume of CA from which the data were 

obtained. 

Sec The .section and issue number of CA from which 

the data wore obtained. 

Start-End The limits (expressed as column numbers, letter 
fractions, and ab.stract numbers of the CVl issue) 
between w'hich the data were obtained. 

Ind The chemist, identified by initials, w'ho dictated 

the data. (The examples use XXX as the indexer’s 
initials.) 



Typ The typist, identified by initials, who keyboarded 
the data. 

Dat 'I'he date the data w'ere keyboarded. 

The following are codes for major fields keyboarded 
for each data set: 

F Formula. 

R Preferred CA index name. 

N Added CA index name. 

Ki .r, Extra added CA index name. 

Eg !» Fields in which systematic, trade, or trivial names 
are input for Index Support through the Nomen- 
clature File. 

C Identification or reference. 

The following are codes for subfields printed 
automatically by the computer: 

MF Molecular formula. 

PINH Preferred index name heading, that portion of the 

index eniry that appears in bold- face type in the 
subject indexes. 

EAI.XH Extra added index name he.ading. 

ID T 'he CA reference including column number, letter 

fraction, abstract number, and comimund number. 
The latter two numhens are for internal computer 
use only. 

T/R The temporary identification number (T) or the 

Registry Number (R) of the compound. 
Temporary identification numbers are used in ini- 
tial processing until a Registry Number is a.s.signed. 
The Registry Number becomes a compound’s per- 
manent identification in the CAS Chemical Com- 
pound Registry .System. 



LITERATURE CITED 

(1) See, for example, Leiter. D. P., Jr.. Morgan, H. L., Stobaugh, 
R. E.. J. Clwm. Doc. 5, 2148 (196.o). 

(2) Leiter, D. P., Jr., Morgan, H. L., “Quality Control and 
Auditing Procedures in the Chemical Abstracts Service Com- 
pound Registry,” ibid.. 6, 22G (19G6). 

(3) These clerical operations are described more fully in Leiter 
etai, i6id..5, 240-1 (1965). 
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APPENDIX F 



Improvements in Structure Registry Effected 



with the Redesign and Reprograimning 
for IBM 360 Computers 
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Extracted from "System and Program. Documentation 
for the Chemical Abstracts Service Registry System" 
Copyright I968 by the American Chemical Society 
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I. 



Systems Improveinents 
1. Standard File Formats 



Standard formats have been established for 
each .item of data in each file. This provides 
several advantages: 



a) Each data element is v;ell-def ined and has 
consistent representation throughout the 
system. 



b) 



Standard formats provide interface contin- 
uity betv;een files - - a given data element 
has the same format in all files. 



C) 



Because data elements have standard formats,’ 
standard input and output programs need be 
written only once, then reused as needed. 
These standard input/output routines are in- 
cluded in the CAS standard subroutine library 



<i) 



Eleven of the 12 data fields on the Registry 
Structure Master File are variable in length. 
Therefore, the system will adapt to future 
changes in data length v/ithout requiring pro- 
gram changes. 



e) 



Standard formats simplify searching and can 
therefore improve search retrieval. 



2 . Impr oved 36 O Hardv;are/Software Capabilities 



The reprogrammed systems v/ill take advantage 
of improved equipment and softv/are capabilities 
offered by the 36 O computers. 



a) Processing is speedier because of faster core 
access times. 



b) 



The ability to process data one-half byte at 
a time permits smaller units of data to be 
manipulated and compacts the file further. 
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3. Modular PrograrmTilng 

The redesigned system V7ill be vjritten as a 
set of program modules ^ each designed to perform 
a specific system function. This offers several 
advantages ; 

a) 



b) 



C) 



d) 



II. Technical Improvements 

1. Tautomers 

The new system provides computer programmed 
identification of unique compounds that can be re- 
presented by two different, but equally valid, 
structural diagrams. 

A generalized representation is: 

M=Q- (Q=Q)^-ZH £r=)H-M- (Q=Q)j^-Q=Z 

where M, Q, and Z are combinations of C, N, P, As, . 
0, S, Se, and Te (including abnormal mass analogs); 
n ^ 0 (integral); = is a double bond; - is a single 
bond; and H must be present as shown. 

The following are examples of the types of 
structures involved: 



It increases the adaptability and flexibility 
of the system because modules can be changed 
without affecting the rest of the system. 

It will be easier to modify programs, since 
systems personnel will work with smaller pro- 
gram "pieces". 

Modular programs make it easier to utilize 
subroutines i'rom the libra.ry. 

Modular programming allows for future expan- 
sion — for example, a module for interfacing 
the Registry v/ith the Substructure Search System, 
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a) General representation 

M=Q=(Q=Q) -Z-H 
M=Z=N ( t rivalent ) 

. Q=C 

n is limited to zero in the initial system; 
the value can be extended in subsequent versions 
(1,2,...) if justified. Analogous structures 
with P or As may be found to exhibit tautomerism; 
provision is made for such an extension if 
necessary. 



H 

/N 



Me- 



N 






Me- 






■NH 



Me- 



NH 

II 

■C — NHEt 



NHo 

I 

Me — C=NEt 



b) General expression. 

M=Q=(Q=Q)n-Z-H 

M=Q=Z=N 

As in a, n is initially limited to zero and 
provisTon is made for future substitution 
of P or As for N. 





PhNH— N~ 




'± PhN = N— NH 




(Rtv.vee) 


















CMiMiCAt AtSTIUCTS SlltVtCI 
SysUwi Dc%tlopmtfif 



'Ml 




OOCUMEWTATtON 

manual 



NAME 

System 

Program 

Module 

Macro 



360 Registry 



TOPIC System Improvements 



ID 

System 

Program 

Module 

Macro 

PAGE 



A015 



c) 



General expression 
M=Q-(Q=Q)n“’Z-H 

M=Z=0,S,Se,Te (and abnormal mass analogs) 

Q=C 

n=0 



OH 

I 

I 

Ke-C=S 



Me-C-SH * <- 



OH 



17 



■Ph-C-OH ■<_ 
S 

II 

Et-C-SeH ^ 



1 7 

Ph“C=0 

SH 



Et-C=Se 
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d ) 



General expression 
M=Q-(Q=Q)^-Z-H 

M=Z= 0 ,S 5 Se,Te(and abnormal mass analogs) 
Q=N,P,As^Sb(tri- or pentavalent ) , 

S,Se,Te(tetra- or hexavalent)^Cl,Br,I 
n=0 



I 

OH 



0 

11 '8 

Ph-As-OH 



OH 
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Me~P=S 



OH 



OK 



18 
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2 . Improved Handling of Ring Bonds 

The new system incorporates improved procedures 
for handling cyclic systems. The paths to identify 
rings are traced starting with the ring closure pairs, 
thus eliminating the duplicate tracing that resulted 
from a trial-and-error basis. This technique provides 
improved efficiency as to type - - ring, chain or 
alternating s ingle -double . 

'3, Automatic Editing of Text Descriptors 

The 360 system includes editing routines for 
text (stereochemical) descriptors which simplify 
structure problem resolution by eliminating part 
of the human involvement in the process. The a.mount 
of chemist time for such resolution will be reduced 
by about 35^* 

A table of standard, valid text descriptors in- 
cluding synonyms is used in a computer edit of this 
feature of the connection table. The program checks 
for the presence (exact match) of the input text de- 
scriptor in this table; rejects the table if the input 
text descriptor is not valid. (unless an override is 
coded); corrects certain descriptors (for the purpose 
of chemical checking because of ambiguity, marked *** 
in the table); and performs automatic resolution in 
ties involving valid unambiguous descriptors. 

Text descriptors have been developed for the 
main body of steroids, terpenes, alkaloids, and 
carbohydrates based upon an alphabetic term or 
base name, representing a basic parent structure 
with implied stereochemistry at specified positions. 
This base name is the name of the parent structure 
or a term closely related to it. 

For steroids, terpenes, and alkaloids, sub- 
stituents attached to the parent at a potentially, 
asymmetric atom require a locant or locants pre- 
ceding the base name. This consists of a numerical 
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3 . 



Improved Handling of Rin^, Bonds 



The nev 7 system will apply improved procedures 
for handling cyclic systems. The paths to identify 
rings are traced starting v;ith the ring closure pairs, 
thus elimina.ting the duolicate'^racing that resulted 
from a trial-and-error basis. This technique provides 
improved efficiency as to type -- ring, chain or 
alternating single-double. 

Automatic Editing of Text Descriptors 

The 360 system will provide editing routines 
for text (stereochemical) descriptors which 
will simplify structure problem resolution by 
eliminating part of the human involvement in the 
process. The amount of chemist time for such 
resolution will be reduced by about 



A table of standard, valid text descriptors 
including synon 5 ''ms is to be used in a computer 
edit of this feature of the connection table. 

The program is to check for the presence (exact 
match) of the input text descriptor in this table; 
to reject the table if the input text descriptor 
is not valid (unless an overide is coded); to cor- 
rect certain descriptors (for the purpose of chemical 
checking because of ambiguity, marked in the^ 

table); and to perform automatic resolution in ties 
involving valid unambiguous descriptors. 

Text descriptors have been developed for the 
main body of steroids, terpenes, alkaloids, and 
carbohydrates based upon an alphabetic term or 
base name, representing a basic parent structure 
with implied stereochemistry at specified positions. 
This base name is the name of the parent structure 
or a term closely related to it. 

For steroids, terpenes, and alkaloids, sub- 
stituents attached to the parent at a potentially 
asymmetric atom require a locant or locants pre- 
ceding the base name. This consists of a numerical 
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locant for the substituent^ baseo on the ac- 
cepted numbering for the parent structure, follov/- 
ed by one of three possible letters showing the 
conf iguration of the attached substituent. Tne 
a (alpha)"configuration, representing a projection 
belov/ the plane of the paper, is shov/n in the 
illustrative examples by a broken line and represented 
by A as a locant. The p (beta) configuration, re- 
presenting a projection by a heavy solid line and 
represented by 3 as a locant. The 4 unknown 

configuration is shown in the examples by a wavy 
line and represented by X as a locant. 



Examples are given, of a steroid (PREGN), 
terpene (LUPANE), and alkaloid (YOHIl^BAR), base 
structure and corresponding base name. Standard 
numbering and stereochemistry are also shown. 
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Exagiples of the use of these text descriptors follow: 



CH3 —C=:iCH 2 





3B,20A-YOKIMBAN 
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For carbohydrates, text descriptors are based 
primaril^^ on the name of the compound. VJhile the 
■ text descriptors can be generated from the struct- 
ure only, a.n adequate name is usually found in con- 
nection with the structure. The systematic names 
of monosaccharides are derived i'rom the trivial 
names of the sugars themselves, e.g., glucose, man- 
nose, ribose. The word roots of these names indicate 
the stereochemistry of some of the hydroxyl groups 
(or derivatives), a.nd the anomeric prefix and con- 
figurational prefix complete the definition of ster- 
eochemistry in the compound. For illustration, in 
the namec^-D-glucopyranose, is the anomeric pre- 
fix, D the configura-tional prefix, and gluco the 
V 7 ord root. The structural diagram is commonly as 

f ollov 7 s : 




I 

H— C—OH 

‘I 

H~C— OH 

HO—C—H 

1 

H—C— OH 



^CHgOH 




OR 




of defines the stereoisomerism at C-1, D at C-5> 
and gluco at C-2, and C-4. Thus the combination 
-D-gluco, or s^s written for the text descriptor, 
A-D-GLUCO defines the stereochemistry of the com- 
pound. 

4. Improved Handling of Coordination Compounds 

Detailed structuring conventions for coordi- 
nation compounds have been established to assure 
unique identification of each compound and increase 
the level of detail available for substructure 
searching. The structure conventions are designed 
to provide the two important values associated with 
the central atom of the coordination compound: 
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65-I65 (R«v.3/68) 



Coordination 
oT 



attachmeni: 
ination number 



Number: the sum of the bond 

“to cHe a-tom is equal to the 



lines 

coord- 



Oxidation State (Stock Number^ valence): the 

number“of~clTarges on the central atom (positive, 
negative, or zero) is equal to the oxidation state 



In connection V 7 ith these tv 7 o points, appropriate 
computer edit of bonds and number (and kind) of 
charges for given eleirients is provided. For each 
element treated there are allowable charges and 
for each of these there is one or more permitted 
number of bonds. The table (Appendix B) of element 
symbols with permitted charges and corresponding 
bond lines is used in this edit. 

N. B. Whereas "coordination number" and 
"oxidation state" ai'e used for in- 
put edit and are a part of the ma- 
chine record, these numbers can be 
eliminated as a part of the direct 
structural output being developed 
for the System. 

Additional Edit Checks for Abnoimial Mass Citations 

Unlike the 7010 system, which did not check 
abnormal mass citations, the 3^0 system will check 
such citations against a stored list of acceptable 
values for initial mass checking, these being the 
most common cited' in chemical and biochemical texts, 
reference V 7 orks, and abstracts. The elements and 
mass values are given in the table. Unless the in- 
put structure contains a permissible mass, the 
structure will be rejected. 

Editing Checks for Stock Numbers 

For a selected list of multivalent metals, a 
list of acceptable oxidation states (represented 
by Stock numbers) has been established. The struct- 
ures concerned are the metal salts of acidic com- 
pounds, represented in the structure file as "dis- 
connected" structures. A multivalent metal must 
have an associated Stock number in this situation. 
Monovalent metals do not require a stock number. 

The 360 ” system will check input Stock numbers 
against a table before accepting a structure. 
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7 , Registration of Larger, More Complex Molecules 



Largely because of the increased speed and 
capacity of the 560 computers, several arbitrary 
limitations of the size and complexity of molecules 
can be removed. 

a) The maximum number of nonhydrogen atoms 
per compound has been increased from 150 
to 25^- for machine registration. 

b) The maximum permissible number of non- 
hydrogen attachments to any one atom 
has been increased from six to 15. 



III. 



imiiig 






c) The maximum permissible number of paths 

traced during unique table generation has 
been increased from a kind of 10^ to the 
number traced in a time limit of 2 minutes 
and 10 seconds. This villi permit highly 
symmetrical molecules to be registered. 

This is particularly useful for coordina- 
tion compounds, in which six, seven, or 
eight a^ttachments to one metal atom may 
give rise to a more complex type of symmetry 
than exists for the usual, organic compounds 
with up to four attachments to carbon atoms. 

User Oriented Improvements 

While all improvements to the 560 Registry will im- 
prove the value, reliability and speed of the system for 
its users, several changes are specifically user-oriented. 



Structure Match Without Registration 

Structure match against the Registry File will 
be possible v^ithout registration. That is, compounds 
can be matched without adding them to the file. This 
will be a significant advantage to users with confi- 
dential or proprietary compounds. 
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Molecular Formula Processing 



Unlike the 7010 system, vjhich checked the molec- 
ular formula early in processing and then dropped the 
formula until nearly the end of processing, the 3^0 
system carries the formula through all structure pro- 
cessing steps o This permits the user to make use of 
a portion of the CAS registration processing, that is, 
exit from the system at one of several possible points 
prior to registration, and maintain a complete record. 
This record contains the molecular formula in the 3^0 
system, whereas it did not in the 7^10 system. 



Substructure Search Improvements 



The new system v;ill explicitly record the number 
of hydrogen atoms bonded to each noncarbon a/com in a 
compound, (Previously, only the total number of hydro- 
gen atoms in the molecule had been recorded). This 
change v;ill provide greater structural detail for sub- 
structure search jjig. 
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Valid Text Descriptor Terms 




ALL 




COA 


EUDESMANE 


ALLO 




COKAWINE 


EXO 


ALLOC IMAMIC 


CRINAN 


FACET 


ALTRO 




CYCLOHEXIMIDE 


FOLATE 


AMBRANE 




D 


FUROST 


AMBROSARE 


DAlvl^iARANS 


GALACTO 


ANDROST 




DECAilBR 


GALANTBu\MINE 


ANTI 




DISLDRIN 


GAMBOGIC 


APORPHINE 


DIISOTACTIC 


GAl'HiACERANE 


ARABINO 




DIILER 


GERIiACRANE 


ATISAKE 


• 


DISYNDIOTACTIC 


GIB3AI\^ 


BERBINE 




DL 


GLIOTOXIN 


BRUCINE 




E 


GLUCO 


BUF 




EMETINE 


GLYCERO 


CADINAHE 


END0'>^->«*^ 


GON 


CARD 




EKTDRIN 


GUAIANE 


CATHARANTHINE 


EPHEDRINE 


GULO 


CEPHALOSPORANIC 


EPHENAMINE 


HEPTAMER 


CEPHALOSPORIN 


EREMOPHILANE 


HEXAMER 


CEVANE 




ERGOLINE 


IDO 


CHLORAl«>HENICOL 


ERGOST 


ISOCHLOROGENIC 


CHLOROGENIC 


ERYTHRO 


ISOMORPHINAN 


CHOL 




ESTR 


ISOPULEGONE 


CHOLEST 


ETHAMBUTOL 


ISOTACTIC 


CIS*** 




EUCANINE 


KAURANE 


*** Ambiguous descriptor requiring technical reviev/. 
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PODOCARPANE 


TETRACYCLINE 


LABDAKE 




PREGN 


TETRAI-ffiR 


LANOST 




PSEUDOEPREDRIKE 


THREO 


LIWAM. 




PSEUDOTROPIRE 


TOMATIDANE 


lupane 




PULEGOIDE 


TRANS’^** 


LYCORAN 




QUINIC 


TRD4ER 


LYSERGIC 


QUINIDINE 


TROPANE 


LYXO 




QUININE 


TROPINE 


MANNO 




R 


URSANE 


MENTHOL 




RIBO 


VINCAMIHE 


MESO 




ROSANE 


X 


METARAMINOL 


S 


XYLO j 

1 


MORPHINAN 


SCOPOLAMINE 


YOHIMBAN j 

i 


MUCO**^ 


SCYLLO*-^** 


Z j 


MYO*** 




SECURININS 


(+)*** j 


NEOCHLOROGENIC 


SHIKIMIC 


( _ ) *** J 


NEURAMINIC 


SIALIC 


1 


NONAMER 


solanidane 


^(asterisk) I 


NS 




SOLAS ODANE 




OCTAMER 


SPARTEINE 




OLEAHANE 


SPIROST 




ONOCERANE 


STIG 




f 

OXYTOCIN 


STRYCHNINE 




PENICILLIN 


SYN 


'9 


PENTAMER 

• 


• SYNDIOTACTIC 


'9 


PHYLLOCLADANE 


TALO 




*** Ambiguous 


descriptor requiring technical 


review. I 







- - - Aiii{fcicjli*WM«,%, 



! 



1 . 



iir 









CNi^iUl 4aS1ftACtt UfiVlCI 
SysUmi D« vftopfirtcnf 


NAME 

System 3 6 0 Registry 


ID 

System A015 


S'?' 


Program 


Program 




Module 


Module 


l-7-i 


Macro 


Macro 






oocumcntation 


TOPIC C);a,rges and Bonds for Coordination 
lut-io ComDOunas . , _ 


PAGE 


MANUAL 


_ - — • 
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Coordination Compounds 




Element 


Symbol Charges 


Bond Lines 


Ac 


3+ 


6 


Ag 


n- 

2+ 


2,4 

4 


A1 


3+ 


4,6 


Am 


3+ 


6 


As 


34 . 


4 




5-f 


6 


Au 


1+ 


2 


3+ 


4 


B 


3+ 


4 


Ba 


2'I- 


4,6 


Be 


2H- 


4 


Bi 


3-^- 


4 




5+ 


6 


Bk 


3+ 


6 


Ca 


2+ 


4,6 


Cd 


2+ 


4,6 


Ge 


3+ 


6 


4+ 


6 


Cf 


3+ 


6 


Cm 


3+ 


6 


Co 


1~ 


4 


0 






1+ 






2+ 


4,5,6 


• 


3+ 


6 
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Element 


Symbol 




Charges 


Bond Lines 


Cr 






0 

2+ 

3-i- 

6+ 


6 

4,6 

6 

8 


Cs 






1+ 


4,6 


Cu 






1-t . 


2,3,4 


By 






3+ 


6 


Er 






3+ 


6 


Es 






3*H 


6 


Eu 






2+ 

3+ 


4 

6 


Fe 


« 




0 

2 + 

3+ 


5.6 

4.6 
4,6 


Fm 






3+ 


6 


Fr 






1+ 


4,6 


Ga 






1-1- 

3_j- 


4,6 

4,6 


Gd 






■ 3+ 


6 


Ge 






2+ 

4+ 


4 

6 


Hf 






4"i- 


6,6 


Hg 






1+ 

2+ 


2,4 

4,6 


Ho 






3+ 


. 6 


In 




1+ 

3-r 


4 

4,6 


Ir 

» 




0 

1- i- 

2- f 


^ 1 
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Element Symbol 


Charges 


Bond Lines 


K 


1 + 


4,6 


1j8« 


3 + 


6 


Li 




4 


Md 


3 + 


6 


Mg 


2 + 


4 


Mn 


0 


6 




1 + 


6 




2 + 

0 ' 
OT 


^, 5,6 

6 




L+ 


6 




6 -{- 


8 




7 + 


8 


Mo 


0 


6 




2-v 




• 


3 + 


6,8 




4 + 


5 ^ 6,8 




5 + 


6,8 




6 + 


8 


Na 


1 + 


4,6 


Kb 


1 - 


6 


• 


2 + 


6 




3 + 


6 




4 + 


6 




5 + 


6 , 7,8 


Nd 


3 + 


6 


Os, 


0 




2 + 


6 




3 + 


6 


• 


4 + 


6 




6 + 


8 




8 + 


9 


P 


3 + 


4 




5 + 


6 


Pb 


• 2 + 


4 


4 + 


6 
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Element Symbol 


Charges 


• Bond Lines 




Pd 


2 >!* 


4 


1 

1 




4 - 1 - 


6 


i 

1 


Pm 


3 + 


6 


1 

1 


Pr 


3 + 


6 


j 




4 -f 


6 




* Pt 


24 - 


4 






4 - 1 - 


6 




Pu 


3 + 


6 


1 




4+ 


6,8 


I 




5 - 1 * 


6,8 


1 


• 


6 - 1 - 


8,10 




Ra 


2-1- 


4,6 


1 

j 


Rb 


14 - 


4,6 




Re 


0 


6 


1 

j 




3 + 




1 

J 




44- 


6 




• 


5 + 


6,8 


? 




64- 


8 






7 + 


8,9 




Rh 


0 


5,6 






1 + 


4,5 






24 - 


5 


S 




3 + 


6 






44 - 


6 




Ru 


0 


5,6 


1 

i 




24 - 


6 


1 




3 + 


6 


1 




44 - 


6 


i 

1 




64- 


8 






7 + 


8 


1 


Sb 


3 + 


4 


1 




5 + 


6,8 
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Element 


Symbol 




Charges 


Bond Lines 


Sc 






3+ 


6 


Se 








6 


Si 






!{.+ 


6 


Sm 






3+ 


6 


Sn 






2 + 

4+ 


4 

6 


Sr 






2 + 


4,6 


Ta 






1 - 

2 + 

3 + 

4-r 

5+ 


6 

6 

6 

6 

6,7,8 


Tb 






3+ 


6 


Tc 






4 + 

7 + 


6 

8 


Te 




4 + 


6 


Th 




3+ 

4+ 


6 

8 


Ti 




2+ 

3+ 

4 + 


6 

6 

6,8 


T 1 




1+ 

3+ 


4 

4,6 


Tm 




3+ 


6 


U 






3+ 

4 + 

§+ 

6+ 


6 

. 6,8 
6,8 
8,10 




J 



I 
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V 



w 



Y 

Yb 

Zr 
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0 

2 + 

3 + 

4 + 

5 + 

2 - 

0 

2 + 

3 + 

4 + 

5 *H 

6 + 

3+ 

2+ 

3+ 

4 -}. 
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6 

6 

5 

6 
4 

6,8 

5 > 6,8 

6,8 

8 



4 

6 



6,8 
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Table of Acceptable Abnormal Mass Values 



Element Symb o 1 Acce p table Mass Values 



Au 


195,198,199 


Br 


77,79,81,82 


C 


11 , 13,14 


Ca 


45,47 


Cl 


36,38 


Co 


56,57,58,60 


Cr 


51 


I 


124 , 125,129, 131, 132 


K 


42 


N 


15 


Na 


24 


0 


17,18 


P 


32 


s , 


35 


Sr 


90 
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Ac 


(III) 




Ag 


(i)> 


(II) 


Am 


(ii)> 


(III) 




(VI) 




Au 


(I), 


(III) 


Bi 


(III) 


, (V) 


Bk 


(III) 


, (IV) 


Ce 


(III) 


, (IV) 


Cf 


(III) 




Cm 


(III) 




Co 




.(III) 


Cr 


(II), 


(III) 


Cu 


(I), 


(II) 


Dy 


(III) 




Er 


(III) 




Es 


(III) 




Eu 


(II), 


(III) 


Fe 


(II), 


(III) 


Pm 


(III) 




Gd 


(III) 




Ge 


(II), 


(IV) 


Hf 


(IV) 




Hg 


(I), 


(II) 


Ho 


(III) 





Table of Acceptable Stock Numbers 

In (I), (III) 



Ir (I), (II), (III), (IV) 



.VI) 
La (III) 
Lu (III) 
Md (III) 



Mn (II), (III), (IV),, (VI), 
(VII) 



Mo 



(III), (IV), (V), 



fi!' 

Nb (II), (III), (IV), (V) 
Kd (III) 

Ni (II), (III) 

Kp (III), (IV), (V), (VI) 



No (III) 



Os (II), (III), (IV), (VI), 
(VIII) 



Pa (IV), (V) 

Pb (II), (IV) 

Pd (II), (IV) 

Pm (III) 

Po (IV) 

Pr (III), (IV) 

Pt (II), (XV) 

pu riij, (III), (IV), (V), 
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Table of Acceptable Stock Numbers 

(Continued) 



Re (III), (IV), (V), (VI), 
(VII ) 



Rh (I), (II), (III), (IV), 
(VI) 

Ru (II), (III). (IV), (V), 
(VI ), (VIIl) 



So (III) 

Sm (II), (III) 

Sn (II), (IV) 

Ta (II), (III), (IV), (V) 

Tb (iii),(rv) 

To (IV), (VII) 

Th (III), (IV) 



Ti (II), (III), (IV) 

T1 (I), (III) 

Tm (III) 

U (III), (IV), (V), (VI) 

V (II), (III), (IV), (V), 

(vi) 

W (II), (III), (IV), (V), 

(vi) 

Y (III) 

Yb (II), (III) 

Zr (IV) 
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A computer-based system of registration is being developed for pol^nners. 
Through this system poisoners will be handled as an integral part of the 
Chemical Abstracts Service Registry System. The computer-based recognition 
system will depend on identification of polymer and/or monomer struct'ural 
units. As is the policy for compounds, the polymers which are identified 
and indexed in Chemical Abstracts (CA) will be entered into the Registrj''. 

CA index nomenclature for polymers has been improved and expanded to keep 
pacc: Wx oh tiiS RegxScry oysterii. In ao.u.xoion to riuproved hancllitng in the CA 
subject indexes, poi^ers will be included in CA formula indexes starting 
with Volume 66 (January -June 1967 )* Tiie Registry’s files of nomenclature 
(i.e., ^ index names, non-CA systematic names, unsystematic names, acronjnns, 
etc. ) and bibliographic data link the registered material to the published 
source documents in which further information can be obtained. 



NOT FOR PUBLICATION 




SYSTEl-lS FOR REGISTERING AI'JD NAMH\^G POLYMERS AT 
CHEIvIICAL ABSTRACTS SERVICE 

Polymers present special problems to those who must organize chemical 
information. Polymers are unlihe most compounds, which usually have fairly 
compact molecules whose stnActure can be readily determined and easily 
characterized in structural diagrams. Instead, polymers normally represent 
large moleciiles whose structure is difficult to determine, and therefore is 
seldom exactly reported. Moreover, a given polymer sample is characteris- 
tically a mixture of different structures. This indefiniteness of poljmers 
sets them apart from fully defined compounds and leads to the problems 
experienced by persons who must design structure-based systems for organizing, 
naming, and indexing chemical compounds. Since Chemical Abstracts Service 
is active in all of these areas, we have recently been studying these problems 
with an eye toward establishing methods for registering polymers and for 
improving their nomenclat^ire. 



Polymer Registration 

CAS is now in the process of developing a computer-based Chemical Compound 
Registry System designed for two purposes (Slide l) : to identify, or recognize, 

chemical substances on the basis of their structural characteristics, and to 

file the structural and molecular formulas, names, and bibliographic citations 

« 

for each substance. 

The Registry System is a computer-based identification system which 
uniquely identifies chemical structures. The Registry Number is assigned 



iu 












- 2 - 

to each structure when the structure is first entered into the file. When- 
ever a structure which is already present on the structure file appears in 
a new source, the previously assigned number is recovered automat ica3-ly. This 
Registry Number functions as a machine address within the associated structure, 
nomenclature, and bibliographic data files. Tliis information can be correlated 
with data furnished in the original source through the ^ or other reference. 

The principal file of the Registry System is the Chemical Structure File 
(Slide 2 ), a machine -language listing of the structure in terms of the atoms 
and their connections, or bonds. This record, often referred to as a "connection 
table," contains all of the information of the two-dimensional projection of 
the chemical structure. Additional third-dimension information is handled 
by terms called stereochemical descriptors such as endo and exo, plus and 
minus. Different stereoisomers are assigned different Registry Nmbers. 

The alternating single and double bonds of cyclic aromatic systems, such as 
benzene rings in the first two structures, are recognized by computer program 
as being equivalent, regardless of the particular resonance structure which 
might be entered. 

Routine registration began in 196 k and initially included organic com- 
pounds with fully defined two-dimensional structural diagrams, excluding 
coordination compounds. All such compounds indexed in Chemical Abstracts 
since 1964 have been registered, as have compounds from several other sources, 
including reference books and CAS internal files. Some 650,000 different | 

compounds are now on file. 

It is the established goal of CAS to extend this Registry System to 
all substances, including coordination compounds, inorganic compounds, mixtures, 
and polymers. 
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As I have noted, the computer recording of polymers presents some 
interesting problems that are peculiar to this type of chemical substance. 
Polymers are substances made up of recurring structural units, each of which 
can be regarded as derived from a specific compound, or monomer. The number 
of monomeric units is usually large and variable, a given polymer sample 
being characteristically a mixture of structures with different molecular 
weights. Thus, polyners are structurally indefinite. I previously described 
the Registry System as a computer-based identification system which provides 
a uni(iue identification of chemical structures. However, the word ''unique*' 
is equivocal when applied to polymers. Registration of polymers, then, must 
involve classification into more or less well defined structural fami3.ies 
rather than recording of unique polymers. 

In order to devise a system for registering polymers within the frame- 
work of the established structure -based Registry System, it has been necessary 
for CAS to determine the type of information on polymers that is routinely 
available from the literature. Some polymers will be described as fully 
defined structures, just as most compounds are, while other polymers will 
have no structural information given. Between these two extremes lie polymers 
that are described in terms of monomers, or the processes used to make the 
polymers, or various physical properties, or the significant repeating units. 
The amount of each type of information available in the literature 
to the design of registration techniques, since it is the goal of 
to provide as much differentiation between substances as possible. 

Our earliest studies of polymer literature concentrated on tl: 
literature. Members of the research staff at CAS found in an anaJ 
the literature on butadiene polymers that definite structures were 
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lide 4 



lide 5 



lide 6 



lide 7 



:lide 8 
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2 .yjo of the 6ll polymers reviewed, 65^ were described in monomer terras, and 
12 .% in nonstructural terms (Slide 3)» A. fol3,ow~up survey of entries from 
trade and commercial compilations, including drugs, foods, feeds, etc., showed 
that of 2366 polymers found, 21 %o were described in terns of a significant 
repeating unit, 1% in terms of the monomers, and G\% in nonstructural terms 
as sho\m on Slide 4. 

These studies revealed that many polymers appear in the literature in 
sufficient detail to permit registration by polymer structure; that is, by 
using in various combinations the significant repeating unit (SRU) in a polymer 
chain and the constituent monomers. The significant repeating unit is defined 
as the smallest group of atoms which on sequential repetition represents the 
main structure or backbone of the polymer. Registration by nonstructural 
information is also possible. The following c.lasses of po.lymers represent 



the ways in which polymers may be registered: 

la) Polymers described by SRU alone, with no monomer information. 



Example 1 on Slide 5 shows an SRU without end groups. 



and Example 2 on Slide 5 shows an SRU with end groups, 
lb) Polymers described by SRU with monomer information. 

See Slide 6. 



2) Polymers described only in terms of monomers. 



See examples in Slide 7» 

3) Polymers described by non-structural information. 

Slide 8 shows polymers described by names, applications, 



and generic types. 

The first class of polymers, SRU*s without end groups, will be registered 
in terms of atoms and connections for machine representation with open-end 
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^ bonds, i.e., bonds represented, but with no attachments present. Each atom 

of the structure will be marked as being in a repeated group which has a 
repeating factor of n. Structures of this type must have two and only two 
open bonds and a repeating factor of n for acceptance by the computer Edit 
Program which checks the validity of structural diagrams. 

Structure handling of ladder polymers, three-dimensional polymers, and 
similar types of polymers has not yet been established. That ds, studies 
to develop methods of registering polymers with repeating units which have 
more than two open bonds, with or without stated end groups, are still in 
progress. 

Since a repeating unit in a poljTrtier may be written in more than one 

Ide 9 form as shovm in Slide 9 > the poljnner will be treated during computer edit- 
ing as if the two open bonds were joined to form a ring. This will assure 
the generation of identical unique connection tables for the same polymer, 
even thou^ it can be represented by more than one structure drawing. 

Polymers described in terms of an SRU with end groups including hydrogen 

I 

will be input by similar procedures. Each atom of the repeating unit will 
be marked as being in a repeated group with n as the repeating factor. During 
the computer edit process the end bonds of the SRU will be treated as if they 
were joined to form a ring, and the end groups will be treated as disconnected 

t fragments. However, the original attachments to the end groups will be avail- 

1 

able in the computer record. This method is used to simplify substructure 

, searching in polymers. Cross references will be automatically generated for 

all members of a set of polymers with the same SRU but different end groups. 

[ 

A significant repeating unit furnished with monomer information is 

I registered in the same way as that unit without monomer data. For example, 

t 
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Slide 10 illustrates poly(ethylene terephthalate) expressed as a significant 
repeating unit with no monomer information given, as prepared from disodium 
terephthalate and ethylene dibromide, and prepared from terephthaloyl chloride 
and ethylene glycol. The unit itself is considered the same and receives 
the same Registry Number in each case. 

Polymers of the second class, for which only monomer information is 
supplied, receive different Registry Numbers for each constitvient monomer, 
and appropriate cross-references between polymers and monomers will be 
supplied. 

The third type of polymers, those with structural information but 
described by type or application (Slide 8 repeat), will be registered by 
name, as assigned by the author. Commonly recognized synonyonous names will 
be given the same Registry Number; othenrise, polymiers of this class with 
different names are considered different and will receive different Registry 
Numbers . 

Criteria for Differentiating Polymers 

In developing these techniques for registering polymers, it has been 
necessary to establish standards as to what is the same and what is different 
concerning polymers. Polymers made up of structurally isomeric significant 
repeating units as illustrated in Slide 11 are considered different, and will 
receive different Registry Numbers. Polymers with the same significant re- 
peating unit and different end groups are considered different; for example, 
the two structures in Slide 12 will receive different Registry Numbers. And, 
stereoisomeric forms of polymers when the stereoisomerism is identified in 
the source document, are considered different and will receive different Registry 
Numbers. 
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On the other hand, some polymer characteristics, which are regarded 
as inconsistent and imprecise for registration purposes, will not be used 
to differentiate between polymers. These characteristics include physical 
properties (molecular weight, melting point, viscosity, solubility, color, 
density, etc.); reaction conditions (pressure, temperature, and solvency); 
and the ratio of monomers involved in a copolymer (either the ratio of 
monomers actually incorporated in the polymer or the ratio of charged mono- 
mers). Polymers differing only in one or more of these characteristics 
will be considered the same, and will receive the same Registry Numbers. 

"Post-reacted" or "after-treated" polymers will be handled in one of 
two ways. When the new polymer formed by the post reaction is clearly defined 
by structure or name, it will be recorded as such and wjill receive a polymer 
Registry Number different from that of the original polymer; but, when the 
new polymer is described in general chemical terms such as oxidized, bromi- 
nated, etc., or in structural terms, block, graft, cross-linked, etc., it 
will be recorded as the original polymer and will receive the same polymer 
Registry Number as that of the original polymer. 

Slide 8 illustrates this technique. A polymer represented as the signifi- 
cant repeating unit (-CHgCHssCHCHg-)^^ may be stated to be oxidized to form 
a polymer made up of the unit (-CH^C0CH^CH_0) . In such a case the latter 
is registered by structure and is different from the original. However, if 

the unit (-CH CH-CHCH -) is stated simply to be oxidized, the original 
2 2 n 

unit is registered and the oxidized polymer recorded as the original one. 
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have had very poor acceptance and are not widely used. Other proposals for 
naming polymers have been made, but none has received the wide acceptance 
and general usage necessary for a nomenclature system. As a result, most 
polymers are described by trivial or trade natnes systems or in terms of 

t 

the chemical monomer (s) that are used in the preparation of the polymer. 

Such systems have serious limitations. In most cases, important structural 
characteristics cannot be described adequately through this type of designation. 
Furthermore, the same family of polymer compounds may be prepared from a 
variety of monomers resulting in wide scattering of information in a com- 
pilation such as the ^ indexes. 

For these reasons, the Noraenc3.ature Committee of the Polymer Division 
of the American Chemical Society has developed a structure -based system for 
naming pcljmers that are described in terms of a known, rcgularlj’" repeating 
structural unit. "When the structure of the repeating vinit is unkno-\*m, the 
polymer is described in much the same fashion now used, i.e., by trivial or 
trade names, or on the basis of the chemical monomers. This system is being 
evaluated and tested during the preparation of the CA Indexes for Volume 66. 

In essence, this structure -based nomenclatxn'e system is based on the 
fundamental structural unit or mer, which is defined as the smallest unit 
(real or hypothetical) of a polymer. The mer is made up of one or more poly- 
valent radicals which, when named as polyvalent organic radicals and cited 
in the directional order specified in the rules, make up the name of the mer. 

The repeating structural unit of a linear polymer may be composed of a unit 
that can be eaqoressed in terms of just one polyvalent radical such as methylene 
or phenylene. Often the mer is more complex and is composed of a series of 
polyvalent radicals. 
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The names of linear poljoners containing polyvalent radicals are formed 
by citing the names of the radicals consecutively in a directional specified 
in the rules. The entire series of poljrvalent radicals is prefixed by the 
term "poly" with suitable enclosing marks. Substituents on the radicals are 
also denoted by appropriate prefixes. Some examples of names of linear poly- 



^ 15 




mers of known structure are shcn-m on Slides 1^<- and 15 • 

Polymers of unspecified structure are indexed essentially in terms of 
the monomers. For example, a homopolymer such as poly (l -hexene) whose 
structure is not described is indexed as "l-hexene, polymer. " A copolymer 
such as a 1-hexene-l-heptene copolymer is indexed as "l-heptene, polymer 
with l-hexene" and also as "l-hexene, polymer with l-heptene. " 

At present, CAS is conducting development tests of both the registration 
and naming techniques for pol^nners. Some po3.ymers s.re now being registered 
on an experimental basis by manual procedures that precisely simulate the 
computer procedures discussed here. These polymers appear in selected 
sources supplied by the National Library of Medicine and the Food and Drug 
Administration under Contract C-4l4 between CAS and the National Science 
Foundation. Through these procedxires, such polymer information as names. 
Registry Numbers, molecular formulas, and the appropriate source indicator 
are machine recorded, but structiaral information is not. 

Computer programming for structural information will be undertaken this 
autumn, and the registration of polymer structures should be possible in the 
spring of I968. At that time the polymers selected as entries in the CA 
indexes for Volume 66 (july-December of I967) s-i*© scheduled to be input to 
the system. Polymer entries in succeeding CA volume indexes will be registered 
on a routine basis, as will polymers from the CAS express publications POST-J 
and POST-P. 



I 

! 

1 

I 










] 





m 









fwanKaiar 






- 10 - 

Tlie structure -based nomenclature system for poljmiers is being evaluated 
and testing during the preparation of the CA indexes for Volume 66 will also 
be used, with possible revisions and expansion in succeeding volumes. 
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RESULTS OF BUTADIENE POLYl'^ER STUDY 



structure (SRU) 
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RESULTS OF POLY14ER SURVEY IN TRADE AND 
COMMERCIAL COMPILATION 



structure (SRU) 
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POLYMERS DESCRIBED BY SRU AITD MONOMER INFORMATION 



(-CHo-CH-CHp-CH~)„ 

-2 1 2 I „ 

C02Et CN 



from CH2:CH-C02Et and CHgrCH-CN 
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POLYMERS DESCRIBED BY MONOMERS ALONE 



1. polymer of CHgiCCHiCHg 

CH 3 

2 . polymer of CH 2 :CHCN, PhCH:CH 2 , and CH 2 :CHCH:CK 2 
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POLYMERS DESCRIBED BY NON -STRUCTURAL INFORMATION 
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ALTERNATIVE PORl'lS OF POLYTOT REPEATING UNIT 
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(-CH2 - -CH2 -O-C- -C-0-)jj 
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DESCRIPTIONS OF P0LY(ETHYLENE TEREPHTIIALA.TE) 



1. SRU vrith no monomer information 







- COCH2CH2O y- 



;; 



2 . SRU with monomer information 



HOgC- 'v -COoH 



W 

disodiim salt 



and BrCH 2 CH^Br 



3* SRU with monomer information 



CIOC 



- -CO 



COCl 



and HOCH 2 CH 2 OH 
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POLYMERS WITH STRUCTURALLY ISOMERIC SRU's 
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POLYMERS WITH THE SAME SRU AND DIFFEREI^ END GROUPS 
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POST -REACTED POLYMER 



(-CH2CE : chc:^2-) -. 
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LINEAR POLBIERS COMPOSED OF SIMPLE BIVALENT lOTS 
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LINEAR POLYMERS COMPOSED OF COMPLEX 
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